10,000 Matching Annotations
  1. Last 7 days
    1. eLife Assessment

      This important study utilizes behavioral data and computational modeling to show that spatial properties of visual attention affect human planning. The methodology and statistical analyses are convincing, though the way attention is conceptualized and modeled could be refined. The findings of this study will interest cognitive scientists studying attention, perception, and decision-making.

    2. Reviewer #1 (Public review):

      Summary:

      This study investigated how visuospatial attention influences the way people build simplified mental representations to support planning and decision-making. Using computational modeling and virtual maze navigation, the authors examined whether spatial proximity and the spatial arrangement of obstacles determine which elements are included in participants' internal models of a task. The study developed and tested an extension of the value-guided construal (VGC) model that incorporates features of spatial attention for selecting simpler task mental representation.

      Strengths:

      (1) Original Perspective: The study introduces an explicit attentional component to established models of planning, offering an approach that bridges perception, attention, and decision-making.

      (2) Methodological Approach: The combination of computational modeling, behavioral data, and eye-tracking provides converging measures to assess the relationship between attention and planning representations.

      (3) Cross-validated data: The study relies on the analysis of three separate datasets, two already published and an additional novel one. This allows for cross-validation of the findings and enhances the robustness of the evidence.

      (4) Focus on Individual Differences: Reports of how individual variability in attentional "spillover" correlates with the sparsity of task representations and spatial proximity add depth to the analysis.

      Appraisal of Aims and Results:

      The study sets out to determine how spatial attention shapes the construction of task representations in planning contexts. The authors provide evidence that spatial proximity and arrangement influence which environmental features are incorporated into internal models used for navigation, and that accounting for these effects improves model predictions. There is clear documentation of individual variation, with some participants showing greater attentional spillover and more sparse awareness profiles.

      Comments on revised version:

      The authors did a great job and I am very happy with the revised manuscript.

    3. Reviewer #2 (Public review):

      Summary:

      Castanheira et al. investigate the role of spatial attention for planning during three maze navigation experiments (one new experiment and two existing datasets). Effective planning in complex situations requires the construction of simplified representations of the task at hand. The authors find that these mental representations (as assessed by conscious awareness) of a given stimulus are influenced by (spatially) surrounding stimuli. Individual participants varied in the degree to which attention influenced their task representations, and this attentional effect correlated with the sparsity of representations (as measured by the range of awareness reports across all stimuli). Spatially grouping task-relevant information on either the left or right side of the maze led to mental representations more similar to optimal representations predicted by the value-guided construal (VGC) model - a normative model describing a theoretical approach to simplifying complex task information. Finally, the authors propose an update to this model, incorporating an attentional spotlight component; the revised descriptive model predicts empirical task representations better than the original (normative) VGC model.

      Strengths:

      The novelty of this study lies in the proposal and investigation of a cognitive mechanism through which a normative model like value-guided construal can enable human planning. After proposing attention as this mechanism, the authors make concrete hypotheses about mismatches between the VGC predictions and real human behavior, which are experimentally validated. Thus, not only does this study describe a possible mechanism for simplification of task information for planning, but the authors also propose a descriptive model, revising VGC to incorporate this attentional component.

      A strength of this paper is the variety of investigative approaches: analysis of existing data, novel experiment, and a computational approach to predict experimental findings from a theoretical model. Analyzing pre-existing datasets increases the size of the participant cohort and strengthens the authors' conclusions. Meanwhile, comparing the predictions of the existing normative model and the authors' own refined model is a clever approach to substantiate their claims. In addition, the authors describe several crucial controls, which are key to the interpretability of their results. In particular, the eye tracking results were critical.

      In summary, this paper constitutes an important step toward a more complete understanding of the human ability to plan.

      Comments on revised version:

      I am overall happy with the revision and agree that the authors have addressed most of the comments.

    4. Reviewer #3 (Public review):

      Summary:

      The authors build on a recent computational model of planning, the "value-guided construal" framework by Ho et al. (2022), which proposes that people plan by constructing simple models of a task, such as by attending to a subset of obstacles in a maze. They analyze both published experimental data and new experimental data from a task in which participants report attention to objects in mazes. The authors find that attention to objects is affected by spatial proximity to other objects (i.e., attentional overspill) as well as whether relevant objects are lateralized to the same hemifield. To account for these results, the authors propose a "spotlight-VGC" model, in which, after calculating attention scores based on the original VGC model, attention to objects is enhanced based on distance. They find that this model better explains participant responses when objects are lateralized to different hemifields. These results demonstrate complex interactions between filtering of task-relevant information and more classical signatures of attentional selection.

      Strengths:

      (1) The paper builds on existing modeling work in a novel manner and integrates classic results on attention into the computational framework.

      (2) The authors report new and extensive analyses of existing data that shed light on additional sources of systematic variability in responses related to attentional spillover effects

      (3) They collect new data using new stimuli in the original paradigm that directly test predictions related to the lateralization of task-relevant information, including eye tracking data that allows them to control for possible confounds.

      (4) The extended model (spotlight-VGC) provides a formal account of these new results.

      Comments on revised version:

      I also agree that the authors addressed our comments and the manuscript is much stronger now.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study investigated how visuospatial attention influences the way people build simplified mental representations to support planning and decision-making. Using computational modeling and virtual maze navigation, the authors examined whether spatial proximity and the spatial arrangement of obstacles determine which elements are included in participants' internal models of a task. The study developed and tested an extension of the value-guided construal (VGC) model that incorporates features of spatial attention for selecting simpler task mental representation.

      Strengths:

      (1) Original Perspective:

      The study introduces an explicit attentional component to established models of planning, offering an approach that bridges perception, attention, and decisionmaking.

      (2) Methodological Approach:

      The combination of computational modeling, behavioral data, and eye-tracking provides converging measures to assess the relationship between attention and planning representations.

      (3) Cross-validated data:

      The study relies on the analysis of three separate datasets, two already published and an additional novel one. This allows for cross-validation of the findings and enhances the robustness of the evidence.

      (4) Focus on Individual Differences:

      Reports of how individual variability in attentional "spillover" correlates with the sparsity of task representations and spatial proximity add depth to the analysis.

      We thank the Reviewer for their overall positive assessment of our work and their helpful comments. We have addressed each point below.

      Weaknesses:

      (1) Clarity of the VGC model and behavioral task:

      The exposition of the VGC model lacks sufficient detail for non-expert readers. It is not clear how this model infers which maze obstacles are relevant or irrelevant for planning, nor how the maze tasks specifically operationalize "planning" versus other cognitive processes.

      The method for classifying obstacles as relevant or irrelevant to the task and connecting metacognitive awareness (i.e., participants' reports of noticing obstacles) to attentional capture is not well justified. The rationale for why awareness serves as a valid attention proxy, as opposed to behavioral or neurophysiological markers, should be clearer.

      We thank the reviewer for urging further clarity here. Our work builds closely on the previous maze navigation paradigm and VGC model developed and reported by Ho et al. Nature (2022). We directly adopted variants of their maze stimuli, computational model and obstacle awareness measures, and married these with an investigation of the role of visuospatial attention. We agree that it would be useful for the reader to have a more in-depth description of the paradigm and model, and how it operationalises planning, without needing to refer back to the original Ho et al. paper. We have now added additional explanatory sections to the Introduction and Methods as follows:

      On page 4:

      “One elegant approach to forming such a simplified representation is to adaptively select the granularity of information required to complete the task (Ho et al., 2022), known as value-guided construal (VGC). Unlike previous accounts, which model human planning as a search over all items (e.g.., tube lines), the VGC model predicts that a cognitively limited decision-maker selects a manageable subset of information over which to plan— i.e., a task representation—balancing utility and complexity (Ho et al., 2022). In our example, the VGC algorithm would plan over a few relevant tube lines rather than planning over all possible stations. To select the representation that achieves the best balance between utility and complexity, the model searches across all possible combinations of tube lines, computing the value (i.e., the plan’s utility minus its cost) of each representation for planning a specific journey. The algorithm then selects the representation with the highest value, which ensures that an ideal observer selects a representation which only includes the items (i.e., tube lines) that lead to successful planning while excluding as many items as possible to keep the plan as simple as possible. For our purposes, items included in the representation are considered taskrelevant, while items that are not represented are considered task-irrelevant. This algorithm, therefore, provides a normative standard of an efficient plan to which we can compare people’s actual plans.”

      On page 6:

      “We operationalized planning using a maze navigation paradigm, akin to our tube-related example, where participants were required to plan a route through the maze, avoiding obstacles that blocked their path. Obstacles predicted by the sVGC model to be included in the representation were considered task-relevant.”

      “At the end of every trial, participants reported their awareness of specific obstacles (see Methods for details). The level of awareness reported for different obstacles provides a read-out of what features of the environment individuals were subjectively representing while solving a particular maze. While other markers of attention and awareness (for instance, behavioural or neurophysiological variables) could also be used, here we focused on direct awareness reports in order to relate our findings both to those of Ho and colleagues and to the subjective awareness reports used in consciousness science (e.g. the Perceptual Awareness Scale (Barnett et al., 2024; Overgaard & Sandberg, 2021; Ramsøy & Overgaard, 2004; Samaha et al., 2015)). Participants were instructed to maintain central fixation while planning (see dataset dSC 1), in line with previous empirical work using this task (Ho et al., 2022).”

      To visualize our effects, we binarized the predictions of the sVGC model such that obstacles with a marginalized probability greater than 0.5 were considered taskrelevant, while other obstacles were considered task-irrelevant (e.g., Figure 2b). We now clarify this point in the caption of Figure 2.

      (2) Attention framework:

      The account of attention is largely limited to the "spotlight" model. When solving a maze, participants trace the correct trail, following it mentally with their overt or covert attention. In this perspective, relevant concepts are also rooted in attention literature pertaining to object-based attention using tasks like curve tracing (e.g., Pooresmaeili & Roelfsema, 2014) and to mental maze solving (e.g., Wong & Scholl, 2024), which may be highly relevant and add nuance to the current work. This view of attention may be more pertinent to the task than models of simultaneously tracking multiple objects cited here. Prior work (notably from the Roelfsema group) indicates that attentional engagement in curve-tracing tasks may be a continuous, bottom-up process that progressively spreads along a trajectory, in time and space, rather than a "spotlight" that simply travels along the path. The spread of attention depends on the spatial proximity to distractors - a point that could also be pertinent to the findings here.

      Moreover, the tracing of a "solution" trail in a maze may be spontaneous and not only a top-down voluntary operation (Wong & Scholl, 2024), a finding that requires a more careful framing of the link to conscious perception discussed in the manuscript.

      Conceptualizing attention as a spatial spotlight may therefore oversimplify its role in navigation and planning. Perhaps the observed attentional modulation reflects a perceptual stage of building the trail in the maze rather than a filter for a later representation for more efficient decision making and planning. A fuller discussion of whether the current model and data can distinguish between these frameworks would benefit readers.

      We thank the reviewer for highlighting relevant findings in the attention literature that were missing from our discussion. We fully agree that a complete account of the interplay of planning, navigation, and attention is likely to recruit the kind of curvetracing processes highlighted by the reviewer. However, we emphasise that our current focus is not on the process of navigation through a maze, but on the process of construing the maze itself. In other words, we are focused not on how people represent their path from A to B, but how they represent the maze itself, which they then use as a basis for planning between A and B. The VGC model predicts that a subset of obstacles will be included in this construal. We think that a spotlight model is a good starting point for this work, because attention is being deployed across the whole maze stimulus, and then becomes attached to particular objects located in particular positions. This is a distinct process from that involved in navigating the path itself. Accordingly, our stimuli were designed such that task-relevant obstacles could be presented either proximally or distally to the optimal path (e.g., Figure 1a and Supplemental Figures S1-6). An obstacle that blocks any possible path on one side of the maze is task-relevant but located a long way from the optimal path. The results of Ho and colleagues’ (2022) third experiment demonstrate how task-relevant yet distal obstacles are better remembered than task-irrelevant proximal obstacles (see Figure 4 of Ho et al., 2022). We also observed that obstacles further away from the navigation path were often represented by participants (see Figures S1-6), which cannot be explained by curve tracing alone.

      While these results cannot definitively rule out the possibility that participants automatically trace the path while also construing the maze, they suggest that the value-guided construal process is an independent predictor of participants’ representations beyond proximity to the navigated path. To make this distinction clearer, we now cite the papers alluded to by the reviewer, in the Discussion on pages 28-29, while also acknowledging the potential for investigating attention during the navigation process itself:

      “Future work may also wish to examine the relevance of visuospatial attention for the navigation process itself in this task. While our present findings speak to how individuals perceive the maze while planning, it remains unclear how attention is deployed during navigation along a path, such as how object-based attention progressively spreads along trajectories in time and space(Pooresmaeili & Roelfsema, 2014; Wong & Scholl, 2024).”

      There is also one additional nuance to the current spotlight model that we were inspired to consider by the reviewer’s comment. This is the idea that attentional effects may spread within or along the obstacles themselves. We cannot explore this in the current data because we asked for awareness of the entire obstacles, not parts of obstacles, but it may be possible to explore this in future work, for instance, with eye tracking measures.

      More generally, the growth-cone (i.e., zoom lens) model of attention for curve tracing proposed by Roelfsema and colleagues shares considerable similarities with the spotlight of attention model. Both models argue for the grouping of spatially proximal items based on attention. While the growth-cone model argues for varying sizes of zoom lenses (i.e., receptive fields of neurons) that facilitate the tracing of proximal items, both models predict that spatially proximal items are preferentially processed together because of attention. Indeed, the spotlight model could model these varying zoom lenses by altering the width of the attentional spotlight dynamically across the visual scene based on the spatial proximity of obstacles. Following related comments by Reviewer 2, we now investigate inter-individual differences in the attentional spotlight of participants and observed that these differences significantly predict participants’ mental representations (see Attentional spotlight model of task representations). We have now updated the Discussion to include consideration of these alternative model frameworks:

      On page 27:

      “Second, in the current work we were unable to distinguish whether these attentional effects are driven by a fixed spotlight of attention, or whether attention operates akin to a zoom lens, shifting the ‘width’ of the focus of attention according to the task demands (Eriksen & St. James, 1986; Müller et al., 2003; Schad & Engbert, 2012). The latter view would be consistent with growth-cone models of attention in which the focus of attention expands and contracts in accordance with task demands, mirroring the various receptive field sizes in the visual hierarchy (Pooresmaeili et al., 2014; Pooresmaeili & Roelfsema, 2014). In partial support of this idea, we found significant inter-individual differences in the width of participants’ attentional spotlight (Figure S11). It is also possible that attention is deployed within or along parts of obstacles, rather than on entire obstacles. Future work using naturalistic measures of eye movements may be able to address these questions.”

      (3) Lateralization of attention:

      The analysis considers whether relevant information is distributed bilaterally or unilaterally across the visual display, but does not sufficiently address evidence for attentional asymmetries across the left and right visual fields due to hemispheric specialization (e.g., Bartolomeo & Seidel Malkinson, 2019). Whether effects differ for left versus right hemifield arrangements is not made explicit in the presented findings.

      We thank the reviewer for this suggestion. To address this point, we fitted a three-way interaction model between VGC model prediction, lateralization index, and side (left vs right hemifield). We did not find evidence for the three-way effect (β= 0.01, SE= 0.02, 95% CI [-0.03, 0.04], p = 0.738; ΔBIC = 58.30 in favour of the null effect; see table below), suggesting that the side to which participants lateralized their attention did not influence their task representations. This result is now reported on page 12:

      “This effect did not vary significantly as a function of the specific hemifield (i.e., left vs right) in which task-relevant information was presented (β= 0.01, SE= 0.02, 95% CI [-0.03, 0.04], p = 0.738; ΔBIC = 58.30 in favour of the null effect; see table S14).”

      We also explored inter-individual differences in participants’ tendency to lateralize their attention (see also the next point). We observed that participants tended to lateralize their attention slightly more to the right-hand side for non-lateralized maze stimuli, despite the normative sVGC model predicting that participants should not lateralize their attention for these stimuli (Figure 3c). These results may speak to potential asymmetries in lateralization, but given the exploratory nature of these analyses, they should be verified and replicated in future work.

      (4) Individual differences:

      Individual differences in attentional modulation are a strength of the work, but similar analyses exploring individual variation in lateralization effects could provide further insight, and the lack of such analyses may mask important effects.

      Thank you for this suggestion. In new analyses, we explored whether i) participants exhibited differences in their tendency to lateralize their awareness reports, and ii) whether the degree to which they tended to lateralize their awareness predicted their performance on a separate set of maze stimuli. In short, we observed substantial variation in participants’ tendency to lateralize their awareness (Figure S11) and found that this tendency reflected an inter-individual difference which was stable across maze types. We report these new findings on pages 14-16.

      “Inter-individual variation in lateralization of attention

      Next, we investigated participants’ tendency to pay attention to obstacles within a single hemifield (left vs right) regardless of the sVGC model predictions. To do so, we computed an awareness lateralization index (ALI) based on participants’ self-reported awareness reports of obstacles on each trial (Figure 3a). Large positive values indicate that participants were preferentially aware of the right hemifield, whereas negative values indicate preferential awareness of the left hemifield. Values close to zero indicate that participants paid attention to both hemifields equally (see Methods for details). We observed that participants’ tendency to lateralize their awareness varied greatly across the Ho datasets 1 and 2 (Figure 3b); some participants preferentially paid attention to a single hemifield, regardless of whether the sVGC model predictions were lateralized. For the dSC1 dataset, we observed that on some trials, participants significantly lateralized their awareness (|ALI| > 0.5; Figure 3c) even though the sVGC model predictions were non-lateralized. These findings suggest that participants’ tendency to pay attention to a single hemifield may represent an observable inter-individual difference in how they allocate their awareness to form task construals.”

      “To further explore these inter-individual differences, we tested whether participants’ tendencies to lateralize their attention to a single hemifield was consistent across trials and maze stimuli. We observed that participants’ tendency to lateralize their attention to a single hemifield was similar for left and right lateralized maze stimuli (Spearman ⍴= 0.72, Figure 3d). This suggests that participants who preferentially attended to a single hemifield did so regardless of which hemifield they should attend to. More consequentially, the tendency for participants to lateralize their awareness on maze stimuli whose model predictions were also lateralized linearly correlated with participants’ tendency to lateralize their attention on non-lateralized maze stimuli (Spearman ⍴= 0.88, Figure 3d). Taken together, these findings emphasize that some individuals tend to preferentially attend to a single hemifield when planning. This tendency, importantly, represents an inter-individual difference in how participants allocate their attention across various maze types.”

      (5) Distinction between overt and covert attention:

      The current report at times equates eye movement patterns with the locus of attention. However, attention can be covertly shifted without corresponding gaze changes (see, for example, Pooresmaeili & Roelfsema, 2014).

      We fully agree, and thank the reviewer for prompting further reflection on this distinction. In the online experiments run by Ho and colleagues (i.e., datasets Ho1 and Ho2), participants’ eye movements were not tracked, and therefore, they could not disambiguate whether participants were engaging in covert or overt attention to sample maze obstacles. In our third experiment (i.e., dataset dSC1), we both recorded eye movements and explicitly instructed participants to fixate centrally while viewing the maze. This ensured that participants oriented their attention only covertly during planning (see Figure S13-14).

      We now elaborate on this important distinction in the Results section of the manuscript, page 12:

      “In addition, we monitored participants’ eye movements in dataset dSC 1 to ensure that attention shifts would be covert as opposed to overt—a distinction which could not be determined in the online samples of datasets Ho 1 and 2.”

      On page 28:

      “Importantly, while the visuospatial attention effects observed in the Ho 1 and 2 datasets are likely driven by both covert and overt shifts in attention, the findings presented in experiment 3 (i.e., dSC1 dataset) rule out the contribution of overt shifts in attention through the use of eye tracking (see Figure S13-14)(Carrasco, 2011; Pooresmaeili & Roelfsema, 2014).”

      The implications for interpreting the relationship between eye movement, memory, and attention in this setting are not fully addressed. The potential dynamics of attention along a maze trajectory and their impact on lateralization analysis would benefit from further clarification.

      We thank the reviewer for urging more clarity here. The attentional dynamics we document in our study concern how people perceive / construe the maze itself, rather than how they deploy their attention to guide active navigation. We have now sought to make this distinction clear at a number of points in the paper. The core idea is that attention acts as an early filter to select which obstacles are part of a task construal, which then affects both awareness and memory.

      We have now clarified the focus of our study in the introduction on pages 5-7:

      “Our focus in this study was to examine how participants perceive and represent their environment (the maze stimulus). This is a distinct process to how participants orient their attention during navigation itself, which is not part of our current study. To do so, we harness classical signatures of attentional selection to characterise how visuospatial attention shapes awareness of maze obstacles during planning.” … “Our focus in the present study was to examine attentional effects on participants’ perception of the maze stimulus. We did not quantify how individuals deploy their attention in the phase in which they were navigating through the maze.”

      We did not explicitly test for memory effects in our new experiments, but Ho and colleagues demonstrated that the sVGC model predicted not only awareness reports, but also participants’ memory of obstacles (see Ho et al., 2022). Indeed, task representations computed from memory or awareness reports were strikingly similar in their experiments (Spearman ⍴ = 0.86 between memory accuracy and awareness; ⍴ = 0.86 between confidence in memory and awareness). In relation to eye movements, we refer the reviewer back to our previous response, which details how eye movements were measured and controlled during maze construal.

      Figure 1 legend (b) --> (c)

      We have corrected this typo in the figure caption.

      Reviewer #2 (Public review):

      Summary:

      Castanheira et al. investigate the role of spatial attention for planning during three maze navigation experiments (one new experiment and two existing datasets). Effective planning in complex situations requires the construction of simplified representations of the task at hand. The authors find that these mental representations (as assessed by conscious awareness) of a given stimulus are influenced by (spatially) surrounding stimuli. Individual participants varied in the degree to which attention influenced their task representations, and this attentional effect correlated with the sparsity of representations (as measured by the range of awareness reports across all stimuli). Spatially grouping taskrelevant information on either the left or right side of the maze led to mental representations more similar to optimal representations predicted by the valueguided construal (VGC) model - a normative model describing a theoretical approach to simplifying complex task information. Finally, the authors propose an update to this model, incorporating an attentional spotlight component; the revised descriptive model predicts empirical task representations better than the original (normative) VGC model.

      Strengths:

      The novelty of this study lies in the proposal and investigation of a cognitive mechanism through which a normative model like value-guided construal can enable human planning. After proposing attention as this mechanism, the authors make concrete hypotheses about mismatches between the VGC predictions and real human behavior, which are experimentally validated. Thus, not only does this study describe a possible mechanism for simplification of task information for planning, but the authors also propose a descriptive model, revising VGC to incorporate this attentional component.

      A strength of this paper is the variety of investigative approaches: analysis of existing data, novel experiment, and a computational approach to predict experimental findings from a theoretical model. Analyzing pre-existing datasets increases the size of the participant cohort and strengthens the authors' conclusions. Meanwhile, comparing the predictions of the existing normative model and the authors' own refined model is a clever approach to substantiate their claims. In addition, the authors describe several crucial controls, which are key to the interpretability of their results. In particular, the eye tracking results were critical.

      In summary, this paper constitutes an important step toward a more complete understanding of the human ability to plan.

      We thank the Reviewer for their thoughtful and positive assessment of our findings. We also appreciate the constructive feedback on our methodology, which we believe has substantially improved our manuscript.

      Weaknesses:

      (1) There is a critical conceptual gap in the study and its interpretation, mainly due to the reliance on a self-report metric of awareness (rather than an objective measure of behavioral performance).

      a. Awareness is tested by a 9-point self-report scale. It is currently unclear why awareness of task-irrelevant obstacles in this task would necessarily compromise optimal planning. There is no indication of whether self-reported awareness affects performance (e.g., navigation path distance, time to complete the maze, number of errors). Such behavioral evidence of planning would be more compelling.

      We thank the reviewer for prompting further reflection on the connection between construal and navigation performance. We wish to emphasise that the primary focus of our study was on measuring and modeling participants’ task construals using perceptual awareness judgments, building on the methods developed by Ho and colleagues, rather than on navigation performance itself. However, as the reviewer points out, there is a natural relationship between construal and performance – if you represent the wrong obstacles, plans may be disrupted.

      To explore the relationship between task construals and performance on the navigation task we first regressed out the effects of the sVGC model on participants’ awareness reports and computed the mean squared residuals for each trial. We then used these values to predict participants’ navigation response times on each trial. We observed a significant negative relationship, suggesting that on trials where participants’ representations showed greater deviations from the normative model, they were in fact faster at navigating the mazes. This relationship was surprising, and at odds with the initial idea that adhering to normative VGC aids in task performance. However, we think that this direction of effect may make sense if one considers that a large part of the actual construal (rather than the normative prediction) in our data was in fact driven by effects such as lateralisation which are not accounted for by the sVGC model. If one is faster at harnessing inductive biases such as lateralisation, then one may be faster to complete the maze but also show a greater deviation from the predictions of the original model.

      To further explore these effects, we next focused on the distinction between lateralised and non-lateralised mazes. Here, we reasoned that the initial phase of lateralised attentional selection would lead to lateralised mazes being easier to navigate than nonlateralised ones. We conducted new analyses to determine whether participants navigated lateralized maze stimuli faster and with fewer moves than maze stimuli with non-lateralized model predictions. As detailed in Methods, we excluded trials in which participants significantly deviated from the optimal number of moves (9 or more moves) and took longer than 20 seconds to solve the maze. In line with our interpretation that attention operates as an inductive bias, participants were faster and deviated less from the optimal path on lateralized compared to non-lateralized mazes.

      We now report these new results on navigation performance on pages 20-21:

      “Maze navigation performance

      The previous analyses focused on participants’ task representations during planning. We next sought to explore links between participants’ task representations and maze navigation performance. Participants performed the maze navigation task near-ceiling: they solved 95% of maze stimuli in under 20 seconds, with minimal deviation from the optimal path (i.e., 9 moves or fewer). Notwithstanding this limited variance in task performance, we explored whether participants’ task construals may have impacted their navigation speed. To do so, we first regressed out the effects of the sVGC model from participants’ awareness reports and used the mean squared residuals for each trial to predict response times (see Methods for details). Surprisingly, we observed a negative relationship between mean squared residual variance and response times (β = -0.31, SE = 0.05, 95% CI [-0.41, -0.21], p < 0.001), indicating that participants were faster on trials where the sVGC model explained less variance in their awareness reports. In other words, trials in which participants deviated more from the sVGC model predictions were solved faster. We note that one reason for this may be the strong influence of the lateralisation effect on navigation performance (see paragraph below), which itself is not part of the sVGC model prediction.”

      “We then explored whether participant performance differed between lateralised and nonlateralised mazes. Here, we reasoned that the initial phase of lateralised attentional selection would lead to lateralised mazes being easier to navigate than non-lateralised ones. Consistent with this hypothesis, participants were faster (β = -0.04, SE = 5.91*10<sup>3</sup>, 95% CI [-0.06, -0.03], p< 0.001) and followed the optimal path more closely (β = -0.59, SE = 0.09, 95% CI [-0.78, -0.40], p< 0.001) when maze stimuli were more lateralized.”

      And in the Discussion section, on page 23:

      “Mental representations and task performance

      We observed that participants were faster and deviated less from the optimal path on maze stimuli that were lateralized. This effect is not predicted by the original sVGC model but dovetails with the interpretation that early visuospatial attention operates as an inductive bias to guide the formation of simplified task representations. Surprisingly, we also observed that participants were faster to navigate mazes on trials where their simplified task representation deviated from the sVGC model prediction. We interpret this seemingly contradictory finding in the following way: there are several factors beyond the sVGC model – including, for instance, maze lateralisation – that predict both construal and performance on the maze navigation task. Further work is needed to understand how inductive biases such as lateralisation shape both construal and performance, and the real-world benefits that such strategies might afford for naturalistic stimuli.”

      b. Relatedly, it would have been more convincing to have an objective measure of awareness, for instance, how the presence or absence of a "task-irrelevant" obstacle affects performance (e.g., change navigation path distance or time to complete the maze), or whether participants can accurately recall the location of obstacles.

      We thank the reviewer for prompting further reflection on the validity and robustness of our awareness measures. We emphasise however that our focus is not (primarily) on maze navigation performance, but on task construal, which as noted in our previous response may come apart from navigation performance for a variety of reasons. Our primary goal is to measure participants’ subjective awareness of the maze as a marker of their idiosyncratic (conscious) mental representation on each trial. In doing so, we build on a rich tradition of measuring subjective awareness in consciousness and perception science (for instance, work using the Perceptual Awareness Scale, or detection judgments). In this sense, we think our awareness scale (following Ho et al.) represents a valid and straightforward way of assessing our target psychological construct. However, we also agree with the Reviewer that convergent evidence from other measures is always valuable. In Ho and colleagues’ original paper, they developed a variant of the maze task where participants had to recall the location of obstacles, as well as rate their awareness (Exp 3) and a variant in which participants could hover their mouse over hidden obstacles in the maze to reveal their location – an online metric of attentional deployment (Exp 4). These data afforded us the opportunity to validate the awareness reports against an objective measure of recall, as suggested by the Reviewer. In reanalysing these data, we observed that the obstacle awareness and memory/hover measures were strikingly correlated within two independent samples of participants (Spearman ⍴ = 0.86 between memory accuracy and awareness; ⍴ = 0.86 between confidence in memory and awareness; ⍴ = 0.76 between the probability of hovering over the obstacle and awareness; ⍴ = 0.65 between the duration of the mouse hovering and awareness). These re-analyses are now reported on page 22 of our manuscript, to highlight the convergent validity of the awareness metric:

      “Finally, we examined the convergent validity of participants’ awareness reports by reanalyzing the memory recall data reported in Ho and colleagues’ experiment(Ho et al., 2022). We reasoned that participants should demonstrate similar task representations regardless of the measure used to probe the construal. In line with this prediction, we observed that the obstacle awareness reports and memory/hover measures were strikingly correlated within three independent samples of participants (Spearman ⍴ = 0.86 between memory accuracy and awareness; ⍴ = 0.86 between confidence in memory and awareness; ⍴ = 0.76 between the probability of hovering over the obstacle and awareness; ⍴ = 0.65 between the duration of the mouse hovering and awareness; see Tables S18 and S19).”

      c. Consequently, I'm not sure that we can conclude that the spatial context does impact participants' ability to plan spatial navigation or to "incorporate taskrelevant information into their construal". We know that the spatial context affects subjective (self-reported) awareness, but the authors do not present evidence that spatial context affects behavioral performance.

      Following the line of argument above, we think it’s important to separate out task construal (the simplified representation of the maze, measured by awareness reports), and the impact of this on navigation and other aspects of behaviour. The awareness reports (and other convergent measures) show that task-relevant information (as predicted by the VGC) is incorporated into the construal, a process which is modulated by spatial context. These are the key targets of our modeling. Whether this impacts performance is a distinct question, and one that we now address in our response to point a above.

      d. Another concern that may complicate interpretation is the following: Figure 3c shows improved VGC model predictions (steeper slope) for mazes with greater lateralization. However, there are notable outliers in these plots, where a high lateralization index does not correspond to good model performance. There is currently no discussion/explanation of these cases.

      The Reviewer astutely points out some outliers in our analysis. While on average lateralized maze stimuli are represented more closely to the sVGC model, there are indeed some noticeable outlier mazes. These mazes represent stimuli in which participants tended to lateralize their attention to the ‘wrong hemifield’—e.g., participants were more aware of obstacles in the right hemifield despite sVGC model predicting that obstacles on the left hemifield were task-relevant. We believe this explains the poor sVGC model fits on these trials. We note, however, that on average participants were capable of attending to the correct hemifield without explicit instructions (i.e., 9 out of 12 mazes).

      We have now included a discussion of these outliers in the results section of the paper on page 12:

      “We note that for three maze stimuli whose model predictions were lateralized there was nevertheless a poor fit to the sVGC model (see Figure 2c, right panel). These outliers correspond to maze stimuli where participants, on average, lateralized their attention to the incorrect hemifield (i.e., the opposite hemifield to that predicted by the sVGC model).”

      (2) I noticed an issue with clarity regarding task-relevance. It is currently not fully clear which obstacles are "task irrelevant". Also, the term is used inconsistently, sometimes conflating with "awareness". For example, in the "Attentional spotlight model of task representations" section, the authors state that "taskrelevant information becomes less relevant when surrounded by task-irrelevant information". But they really mean that participants become less aware of those task-relevant obstacles. I assume task-relevance is an objective characteristic related to maze organization, not to a participant's construal. Indeed, the following paragraph provides evidence of model predictions of awareness.

      We apologize for any confusion regarding the terminology of our manuscript. We indeed use the terms task-relevant and task-irrelevant to refer to obstacles that are objectively predicted by the normative sVGC model or the attentional spotlight model to be included in (>0.5) or excluded from (<0.5) task construals, respectively. This designation reflects the predictions from the computational model and does not reflect participants’ reported awareness. We then ran linear hierarchical models to predict participants’ awareness reports from these model predictions. The Reviewer is correct that the task-relevance of obstacles is indeed related to the maze’s organization, and not related to participants’ subjective reports of awareness. We have now clarified this point throughout the manuscript to better emphasize the difference between the model predictions of taskrelevance and participants’ subjective reports.

      On page 17:

      “To achieve this, we computed the predictions of the existing VGC model for each obstacle’s task relevance in a given maze, and averaged these predictions within an attentional spotlight of 3 squares (Figure 4a & S8, see Methods for details). This process yielded novel model predictions, whereby some obstacles which were once predicted as task-irrelevant by the normative sVGC are now predicted as task-relevant by the attentional spotlight model. We depict the effects of this spatial spotlight in Figure 4a: task-irrelevant stimuli (plotted in grey; see middle left obstacle) neighbouring taskrelevant obstacles (plotted in orange) become more task-relevant, whereas taskrelevant information becomes less relevant when surrounded by task-irrelevant information (see bottom right orange obstacle). This deviation in model predictions from the normative sVGC model was used to predict participants’ awareness reports. We hypothesized that this spotlight-VGC model would predict participants’ reports better than the original VGC model, which does not account for spatial attention.”

      (3) The behavioral paradigm has some distinct disadvantages, and the validity of the task is not backed up by behavioral data.

      a. I understand the need for central fixation, but it also makes the task less naturalistic.

      The fixation cross was required on every trial such that participants could maintain central fixation for our eye tracking experiment. While this design is less naturalistic, it allows us to examine the eye movements of participants. Requiring participants to fixate during the ‘planning’ phase of the experiment allowed us to isolate the effects of covert attention from changes in awareness due to overt shifts in attention. In other words, differences in participants’ awareness reports in the 3rd experiment cannot be explained by longer fixation times to specific obstacles.

      b. The task with its top-down grid view does not seem to mimic real human navigation. Though this grid may be similar to mental maps we form for navigation, the sensory stimuli corresponding to possible paths and to spatial context during real-life navigation are very different.

      We agree with the reviewer that while our task is engaging for participants and simple to follow, it does not mimic naturalistic navigation in humans. There is a natural tension in computational / experimental work in cognitive science in wanting to build closely on previous results and paradigms, while ensuring that results can generalise to real-world contexts. Here, our choice of paradigm and measures was closely built on previous papers using this task from Ho and colleagues (2022, 2023). While preparing this response, we learnt that the MIT group had also harnessed this same task to develop a novel dynamic variant of the VGC model (Chen et al., 2026) called the Just in Time model (JIT). The advantage of building on this prior work is that we are able to iteratively refine and expand the VGC approach, and (in our case) bring it into closer contact with work on modeling the deployment of spatial attention in human vision. The top-down aspect of the maze notably facilitated the study of the spatial deployment of attention. We now discuss the novel dynamic variant of the VGC model in our paper on page 27:

      “We close by reflecting on opportunities for further work in this area. First, an important next step is to explore the process by which task representations are formed, and how inductive biases might affect the process of task construal. The sVGC model is a normative model of the optimal task representation. Since it’s construction involves an exhaustive calculation over possible paths, it is not a plausible basis for a model of the psychological process by which participants actually construct task representations. More recently a process model of task construal has been proposed, the Just in Time model (JIT). The hypothesis of the JIT model is that participants’ task representations are built up over time by iteratively simulating possible paths through the maze, affording insight into the construal process (Chen et al., 2026). In future work, it would be of interest to ask whether the attentional effects we observe in our experiments could be meshed with a dynamic JIT account of construal. We speculate that visuospatial attention may operate as an early filter, limiting the space of potential construals based on coarse spatial features of the environment, constraining a dynamic selection of obstacles. Brain imaging techniques with high time resolution, such as M/EEG, may be able to shed further light on how task representations are formed as participants plan.”

      c. Behavioral performance is not reported, so it is unknown whether participants are able to properly complete the task. The task seems pretty difficult to navigate, especially when the obstacles disappear, and in combination with the central fixation.

      Behavioural performance is now reported in response to point 1a above.

      d. There is no discussion of whether/how this navigation task generalizes to other forms of planning.

      We fully agree that an important next step would be to generalise our results on construal to naturalistic forms of planning – for instance, using immersive VR mazes, and or investigating cognitive rather than perceptual construals. We have now added a line to this effect to the Discussion on page 28.

      “An important next step to further our understanding of task representations would be to extend the current paradigm to other forms of planning and more naturalistic tasks, such as navigating immersive virtual reality (VR) environments, planning over cognitive rather than perceptual representations (e.g. planning over an abstract space), or internallyguided planning based on working memory.”

      Reviewer #2 (Recommendations for the authors):

      (1) There are, of course, benefits to simple tasks like the ones described, but it would be interesting to compare the results to a possible experiment in which a top-down grid/map is used for planning, but then task execution is carried out in a simulated environment corresponding to the map. Also, perhaps beyond the scope of the questions addressed in this paper, but I am curious how unexpected obstacles affect representations. For instance, if participants plan based on a topdown map and then begin "real" navigation but encounter an unexpected obstacle that was not indicated on the map, does this modulate representations/awareness of future obstacles (near vs. far)?

      We fully agree that all of these lines of investigation would be super interesting to pursue in future studies, and we have added a line to the discussion to that effect on page 28:

      “An important next step to further our understanding of task representations would be to extend the current paradigm to other forms of planning and more naturalistic tasks, such as navigating immersive virtual reality (VR) environments, planning over cognitive rather than perceptual representations (e.g.. planning over an abstract space), or internallyguided planning based on working memory.”

      (2) Regarding self-reported awareness as a metric, an additional experiment could ask participants to recreate the maze (identify locations of obstacles after they disappear). This would be a more objective measure of awareness.

      Yes indeed, and as described above, this was a metric used by Ho and colleagues in their previous experiment. As we describe in more detail above, the task representations obtained via memory or awareness reports demonstrated striking similarity (⍴ = 0.86).

      (3) What is meant by "all possible orientations of the maze" in this Methods sentence: "For dataset dSC 1, participants solved each of these 24 mazes four times (i.e., all possible orientations of the maze)"?

      We thank the Reviewer for prompting more clarity here. We vertically and horizontally reversed mazes (i.e., left-right flipped) such that participants could not predict the location of the goal or start location. In this way, each maze stimulus had four potential orientations. This resulted in 96 trials of 24 unique mazes. We have clarified this point in the Methods section on page 30:

      Maze stimuli were vertically and horizontally reversed (i.e., left-right flipped) such that participants could not predict the location of the start or goal location. This resulted in four potential orientations of each maze across all 24 mazes, 96 trials in total.

      (4) For lateralization, it was unclear until reading the Methods that the lateralization index was calculated using the VGC-predicted level of taskrelevance. From the main text and Figure 2, I assumed you were just counting the number of task-relevant obstacles on each side, rather than also quantifying relevance. I understood after reading the Methods, but this could be clarified further.

      We agree with the Reviewer that this was not evident from the text. We have now updated the Results section of the manuscript to clarify this point on page 11:

      “To test this hypothesis, we derived a measure of task-relevant lateralization inspired by the attention literature (Ghafari et al., 2024; Keefe & Störmer, 2021; Vollebregt et al., 2015) (Figure 2a). Specifically, we separated maze stimuli across the vertical meridian and computed the ratio of task-relevant information presented on the left versus right side derived from the sVGC model. For example, the maze shown in Figure 2a has twice the amount of task-relevant information presented in the left hemifield than in the right (lat. Index= 1/3). A lateralization index of 0.0 indicates that both hemifields contain equal amounts of task-relevant information (i.e., non-lateralized). The lateralization index was computed using the continuous VGC predictions for each obstacle (see Methods).”

      (5) The explanation in the Methods of how the width of the attentional spotlight was chosen references Figure 1b and Supplementary Figure S2, but it seems that Supplementary Figure S8 explains this more in the caption. Also, I don't see how Figure S2 supports this.

      We apologize for this typo. The explanation of how we selected the width of the attentional spotlight should indeed reference supplemental Figure 15 (previously Figure S8). We have now corrected this and elaborated on this choice in the Methods section on page 35:

      “We fixed the ‘width’ of the attentional spotlight to a distance of 3 squares based on the observation that the two neighbouring obstacles positively predicted the awareness of a probe. We observed that the mean and median distance between neighbouring obstacles of the 2nd rank (i.e., second closest) was 3 squares away for all mazes (Figure S15). We therefore opted to fix the value of the attention spotlight to 3 squares based on these observations. Future work utilizing this model should consider the statistics of their maze stimuli when deciding on the ‘width’ of the attentional spotlight.”

      (6) The attentional spotlight width was assumed to be 3 squares, based on the linear regression predictions of the effect of neighboring obstacles on stimulus awareness. Given the individual differences across participants, it would be interesting to choose a different attentional spotlight size for each participant. Would a participant-specific attentional spotlight width improve the predictions of the spotlight-VGC model?

      The Reviewer highlights a very interesting question: do individuals vary in terms of their attentional spotlight? To test this hypothesis, we first estimated the size of the attentional spotlight for each individual based on lateralized maze stimuli, and then used this to generate personalized attentional spotlight model predictions for each subject based on these values (Figure S11). We restricted this analysis to the dSC1 dataset, where we had substantially more trials (96 in total).

      In brief, we observed that indeed the personalized spotlight model fit participants’ awareness reports better than both a normative sVGC model and a group-level attentional spotlight model. We interpret these findings with some caution as i) a subset of individuals had flat attentional slopes and therefore were excluded from these analyses, and ii) we believe we require additional trials to ensure a robust model fit at the individual level. While our results are encouraging, we hope future investigations into inter-individual differences will extend these findings.

      We have included these additional analyses in the main text.

      On page 18:

      “To further explore inter-individual differences in task construal, we tested whether adjusting the attentional spotlight width to each participant’s awareness reports improved the predictions of the attentional spotlight model. To do so, we first determined the width attentional spotlight of each individual in the dSC1 dataset based on lateralized maze stimuli. We then generated person-specific attentional spotlight model predictions for the non-lateralized maze stimuli to avoid overfitting the data (Figure S11). We note that 7 participants had either flat attentional slopes or negative beta coefficients, which prevented the selection of an appropriate attentional spotlight width (see Methods for details). We observed a significant improvement in model fit for the person-specific attentional spotlight model relative to both the group-level attentional spotlight model (ΔBIC= -1487.39) and the normative sVGC model (ΔBIC= -1655.29). While the limited trial numbers per participant in our current dataset warrants caution in interpreting these findings, these findings do encourage further research on inter-individual differences in attentional deployment during planning.”

      On pages 23-24:

      “Inter-individual differences in attention

      We also observed considerable inter-individual differences in attentional effects across participants (Figure 1c). While some participants were strongly influenced by the spatial context of neighbouring stimuli, others showed more limited evidence for an attentional effect (Figure 1b). Inter-individual differences in attention predicted the sparsity of participants’ simplified representations: participants with larger attention effects exhibited sparser representations. Moreover, these inter-individual differences in effects of spatial proximity could be incorporated into the attentional spotlight model by varying the width of the spotlight, resulting in better model predictions.”

      “Beyond these spatial proximity effects, we also observed that participants varied in their tendency to lateralize their attention to a single hemifield (Figure 3). This tendency was observed across all three datasets, including on maze stimuli whose value-guided model predictions were not lateralized. This suggests that although a strategy of allocating attention is sub-optimal for these maze stimuli, some individuals preferentially attend to a single hemifield in a heuristic-like fashion. This tendency to attend to a single hemifield was a robust inter-individual difference across maze stimuli (Figure 3d), and dovetails with individual-level variation in spatial proximity effects. Taken together, these findings offer novel insights into how people vary in the ways they allocate spatial attention to solve complex problems. Future research could explore how these individual differences constrain performance on other tasks that require planning and search in highdimensional spaces.”

      On page 17 of the Supplemental Materials:

      (7) The supplementary text about lateralization effects, above Supplementary Table S8, references Table S6, but it is Table S6 does not seem to display lateralization results.

      We thank the Reviewer for pointing out this typo: we now refer to the correct supplementary table (S9).

      (8) Why does it matter that "the maze stimuli were not designed to test horizontalmeridian lateralization effects"? What is the effect on power? Is it because there is not a good enough range in lateralization indices? It would be good to clarify, or just remove that explanation, since the cortical retinotopy explanation seems more convincing.

      We did not specifically design the maze stimuli such that there is an equal number of obstacles above and below the horizontal meridian. As such, the lateralization index derived along the horizontal meridian does not control for the number of obstacles in each hemifield, which may influence participants’ awareness reports. In contrast, we designed maze stimuli such that this would not be a concern for the vertical meridian. We have clarified this point in the discussion on page 27.

      “Third, while we observed clear lateralization effects along the vertical meridian (i.e., left vs right hemifield), effects along the horizontal meridian were less clear (i.e., above vs below; see Table S15-16). One potential explanation of this asymmetry is the retinotopic organization of the cortex, in which spatially adjacent stimuli can be retinotopically distant if presented on the opposite side of the vertical (but not horizontal) meridian, facilitating distractor inhibition. Importantly, while the visuospatial attention effects observed in the Ho 1 and 2 datasets are likely driven by both covert and overt shifts in attention, the findings presented in experiment 3 (i.e., dSC1 dataset) rule out the contribution of overt shifts in attention through the use of eye tracking (see Figure S13-14)(Carrasco, 2011; Pooresmaeili & Roelfsema, 2014).”

      (9) For Figure 2c, it would be helpful to directly state what each dot and line mean.

      We updated the caption of Figure 2c to clarify what we are plotting: each point represents an obstacle, and each line the linear fit for a maze stimulus.

      “Each point represents an obstacle in a maze, and each line represents the model fit for that specific maze stimulus.”

      (10) Figures and wording imply there is only a single probe obstacle per trial, but methods and model imply that participants are asked to report awareness for every obstacle. This should be clarified.

      We apologize for any confusion regarding the methodology of our study. The Reviewer is correct that participants reported their awareness of every obstacle presented on a given trial. We have clarified this in the Results section of the manuscript on page 7:

      “Note, participants reported their awareness of every obstacle presented on a given trial.”

      We have also updated the caption of Figure 1 to clarify this point:

      “Once participants finished navigating the maze, they were asked to report their awareness of every obstacle presented on a given trial in a random order.”

      (11) What is the reason for the exclusion of participants (33 for experiment 1 and 26 for experiment 2)?

      Participants were excluded from the Ho et al. datasets 1 and 2 based on their preregistered exclusion criteria, as detailed in the Methods section of their paper. In short, trials were excluded if participants took longer than 20 seconds to complete the trial, or if they spent longer than 5 seconds in the initial state. Participants were excluded if less than 80% of trials remained after reaction time exclusions or if they failed 2 out of 3 comprehension checks. We have elaborated on this point in the Methods section on page 31.

      “Participants were excluded from analyses based on pre-registered exclusion criteria as detailed in (Ho et al., 2022). In short, participants were excluded if 20% or more of their trials were removed based on reaction times, or if they failed 2 out of 3 comprehension checks.”

      (12) The supplemental figures are not referenced in order, and some are not referenced at all; this should be fixed.

      We thank the Reviewer for pointing this out and have reorganized our Supplementary materials accordingly.

      Reviewer #3 (Public review):

      Summary:

      The authors build on a recent computational model of planning, the "value-guided construal" framework by Ho et al. (2022), which proposes that people plan by constructing simple models of a task, such as by attending to a subset of obstacles in a maze. They analyze both published experimental data and new experimental data from a task in which participants report attention to objects in mazes. The authors find that attention to objects is affected by spatial proximity to other objects (i.e., attentional overspill) as well as whether relevant objects are lateralized to the same hemifield. To account for these results, the authors propose a "spotlight-VGC" model, in which, after calculating attention scores based on the original VGC model, attention to objects is enhanced based on distance. They find that this model better explains participant responses when objects are lateralized to different hemifields. These results demonstrate complex interactions between filtering of task-relevant information and more classical signatures of attentional selection.

      Strengths:

      (1) The paper builds on existing modeling work in a novel manner and integrates classic results on attention into the computational framework.

      (2) The authors report new and extensive analyses of existing data that shed light on additional sources of systematic variability in responses related to attentional spillover effects

      (3) They collect new data using new stimuli in the original paradigm that directly test predictions related to the lateralization of task-relevant information, including eye tracking data that allows them to control for possible confounds.

      (4) The extended model (spotlight-VGC) provides a formal account of these new results.

      We thank the Reviewer for their positive assessment of our manuscript and their insightful comments, which has improved the clarity of our findings.

      Weaknesses:

      (1) The spotlight-VGC model has a free parameter - the "width" of the attentional spotlight. This seems to have been fixed to be 3 squares. It would be good if the authors could describe a more principled procedure for selecting the width so that others can use the model in other contexts.

      Our choice for this parameter was informed by the spatial effects reported in Figure 1b. We observed that the two closest neighbouring obstacles to a probe had similar awareness (i.e., positive beta weights). We therefore compute the mean and median distances between obstacle pairs that were the second closest obstacle to a probe. This distance was 3 squares away, as depicted in Figure S15. We fixed the width of the attentional spotlight across all studies based on this observation. We agree that future research utilizing this model may need to tune this hyperparameter depending on the mean distance between a probe and its neighbours.

      We have clarified this point in the methods section on page 35:

      “We fixed the ‘width’ of the attentional spotlight to a distance of 3 squares based on the observation that the two neighbouring obstacles positively predicted the awareness of a probe. We observed that the mean and median distance between neighbouring obstacles of the 2nd rank (i.e., second closest) was 3 squares away for all mazes (Figure S15). We therefore opted to fix the value of the attention spotlight to 3 squares based on these observations. Future work utilizing this model should consider the statistics of their maze stimuli when deciding on the ‘width’ of the attentional spotlight.”

      Following the suggestion of Reviewer 2 point 6, we now also explored inter-individual differences in this parameter. To do so, we first used the lateralized mazes in the dSC1 dataset to determine the optimal width of the attentional spotlight for each individual.

      Then, we used this spotlight to derive model predictions for each person. We observed that these personalized attentional spotlight model predictions fit participants’ awareness reports on non-lateralized mazes better than the fixed-width spotlight model. We believe this preliminary result suggests the importance of modelling inter-individual differences in attentional deployment during planning. We report these effects on page 17.

      (2) Have the authors considered other ways in which factors such as attentional spillover and lateralization could be incorporated into the model? The spotlightVGC model, as presented, involves first computing VGC predictions and only afterwards computing spillover. This seems psychologically implausible, since it supposes that the "optimal" representation is first formed and then it gets corrupted. Is there a way to integrate these biases directly into the VGC framework, perhaps as a prior on construals? The authors gesture towards this when they talk about "inductive biases", but this is not formalized.

      We thank the reviewer for bringing up this very important point. We think that a full computational treatment of the inductive bias would be a distinct project, but now seek to expand our discussion on the mechanisms by which representations could be formed. In this context, we specifically highlight novel computational work from the MIT group that was published as a preprint in the time since we submitted our paper, and which proposes a new process account of construal, the “Just in Time” (JIT) model. We also elaborate on a possible mechanism by which visuospatial attention may aid the dynamics of the construal process. In short, we agree with the reviewer that spatial attention may bias individuals to search over a subset of potential representations based on low-level spatial characteristics of the obstacles (e.g., their spatial spread in the visual field), prior to (or in concert with) a dynamic JIT-like selection process. We now elaborate on these possibilities on pages 27-28:

      “We close by reflecting on opportunities for further work in this area. First, an important next step is to explore the process by which task representations are formed, and how inductive biases might affect the process of task construal. The sVGC model is a normative model of the optimal task representation. Since it’s construction involves an exhaustive calculation over possible paths, it is not a plausible basis for a model of the psychological process by which participants actually construct task representations. More recently a process model of task construal has been proposed, the Just in Time model (JIT). The hypothesis of the JIT model is that participants’ task representations are built up over time by iteratively simulating possible paths through the maze, affording insight into the construal process (Chen et al., 2026). In future work, it would be of interest to ask whether the attentional effects we observe in our experiments could be meshed with a dynamic JIT account of construal. We speculate that visuospatial attention may operate as an early filter, limiting the space of potential construals based on coarse spatial features of the environment, constraining a dynamic selection of obstacles. Brain imaging techniques with high time resolution, such as M/EEG, may be able to shed further light on how task representations are formed as participants plan.”

      […]

      “Fourth, it will also be necessary to elaborate on how bottom-up and top-down aspects of attentional selection are combined to guide complex task representations and plans. Foundational questions remain unanswered, for instance: can multiple spatial locations be preferentially selected at once, i.e. are there multiple spotlights (Awh & Pashler, 2000; McMains & Somers, 2004; Pylyshyn & Storm, 1988; Shaw & Shaw, 1977)? There is also discourse on how spatial attention may move from one location to another: are the intervening visual regions between attended locations similarly selected (Dubois et al., 2009; Kr & Np, 1999; McMains & Somers, 2004, 2005)? Our findings tentatively suggest that individuals are able to attend to disparate spatial regions to form sparse task representations, yet there is substantial variability in how individuals orient their attention during the task. The present paradigm and computational modelling, in conjunction with carefully designed stimuli, may help resolve these outstanding questions.”

      (3) Can the authors rule out that the lateralization effects are the result of memory biases since the main measure used is a self-report of attention?

      We thank the reviewer for bringing up this important point. In our experiments, we sought to measure participants’ subjective awareness of the maze stimuli as a readout of their conscious task representation on each trial. This approach marries an extensive literature on measures of perceptual awareness in consciousness science (e.g., using the Perceptual Awareness Scale) with computational models of planning. Participants’ memory of (their awareness of) the obstacles is inherent to this approach, but just as with similar approaches in consciousness science (e.g. measures of iconic memory in the Sperling paradigm), we think it provides a reasonably “online” measure of awareness. It’s important of course to ensure that results obtained with awareness reports are not idiosyncratic, and generalise to other approaches to quantifying task representations.

      To further bolster the convergent validity of our awareness measure, we reanalyzed the data from Ho and colleagues. In their original paper, they developed a variant of the maze-navigation task where participants were asked to recall the location of obstacles as well as report their awareness (Exp 3) and a third variant of the task where participants could hover their cursors over hidden obstacles to reveal their locations (Exp 4). These data allowed us to validate the awareness reports against objective measures of recall and mouse-tracking data. We observed that the subjective awareness reports of participants were strikingly correlated with recall/hover measures across two independent samples of participants (Spearman ⍴ = 0.86 between memory accuracy and awareness; ⍴ = 0.86 between confidence in memory and awareness; ⍴ = 0.76 between the probability of hovering over the obstacle and awareness; ⍴ = 0.65 between the duration of the mouse hovering and awareness). We believe these findings validate participants’ awareness reports. These findings are now reported on page 22 of the manuscript.

      “Finally, we examined the convergent validity of participants’ awareness reports by reanalyzing the memory recall data reported in Ho and colleagues’ experiment (Ho et al., 2022). We reasoned that participants should demonstrate similar task representations regardless of the measure used to probe the construal. In line with this prediction, we observed that the obstacle awareness reports and memory/hover measures were strikingly correlated within three independent samples of participants (Spearman ⍴ = 0.86 between memory accuracy and awareness; ⍴ = 0.86 between confidence in memory and awareness; ⍴ = 0.76 between the probability of hovering over the obstacle and awareness; ⍴ = 0.65 between the duration of the mouse hovering and awareness; see Tables S18 and S19).”

    6. eLife Assessment

      This important study utilizes behavioral data and computational modeling to show that spatial properties of visual attention affect human planning. The methodology and statistical analyses are solid, though the way attention is conceptualized and modeled could be refined. The findings of this study will interest cognitive scientists studying attention, perception, and decision-making.

    7. Reviewer #1 (Public review):

      Summary: This study investigated how visuospatial attention influences the way people build simplified mental representations to support planning and decision-making. Using computational modeling and virtual maze navigation, the authors examined whether spatial proximity and the spatial arrangement of obstacles determine which elements are included in participants' internal models of a task. The study developed and tested an extension of the value-guided construal (VGC) model that incorporates features of spatial attention for selecting simpler task mental representation.

      Strengths:

      (1) Original Perspective: The study introduces an explicit attentional component to established models of planning, offering an approach that bridges perception, attention, and decision-making.

      (2) Methodological Approach: The combination of computational modeling, behavioral data, and eye-tracking provides converging measures to assess the relationship between attention and planning representations.

      (3) Cross-validated data: The study relies on the analysis of three separate datasets, two already published and an additional novel one. This allows for cross-validation of the findings and enhances the robustness of the evidence.

      (4) Focus on Individual Differences: Reports of how individual variability in attentional "spillover" correlates with the sparsity of task representations and spatial proximity add depth to the analysis.

      Weaknesses:

      (1) Clarity of the VGC model and behavioral task: The exposition of the VGC model lacks sufficient detail for non-expert readers. It is not clear how this model infers which maze obstacles are relevant or irrelevant for planning, nor how the maze tasks specifically operationalize "planning" versus other cognitive processes.

      The method for classifying obstacles as relevant or irrelevant to the task and connecting metacognitive awareness (i.e., participants' reports of noticing obstacles) to attentional capture is not well justified. The rationale for why awareness serves as a valid attention proxy, as opposed to behavioral or neurophysiological markers, should be clearer.

      (2) Attention framework: The account of attention is largely limited to the "spotlight" model. When solving a maze, participants trace the correct trail, following it mentally with their overt or covert attention. In this perspective, relevant concepts are also rooted in attention literature pertaining to object-based attention using tasks like curve tracing (e.g., Pooresmaeili & Roelfsema, 2014) and to mental maze solving (e.g., Wong & Scholl, 2024), which may be highly relevant and add nuance to the current work. This view of attention may be more pertinent to the task than models of simultaneously tracking multiple objects cited here. Prior work (notably from the Roelfsema group) indicates that attentional engagement in curve-tracing tasks may be a continuous, bottom-up process that progressively spreads along a trajectory, in time and space, rather than a "spotlight" that simply travels along the path. The spread of attention depends on the spatial proximity to distractors - a point that could also be pertinent to the findings here.

      Moreover, the tracing of a "solution" trail in a maze may be spontaneous and not only a top-down voluntary operation (Wong & Scholl, 2024), a finding that requires a more careful framing of the link to conscious perception discussed in the manuscript.

      Conceptualizing attention as a spatial spotlight may therefore oversimplify its role in navigation and planning. Perhaps the observed attentional modulation reflects a perceptual stage of building the trail in the maze rather than a filter for a later representation for more efficient decision making and planning. A fuller discussion of whether the current model and data can distinguish between these frameworks would benefit readers.

      (3) Lateralization of attention: The analysis considers whether relevant information is distributed bilaterally or unilaterally across the visual display, but does not sufficiently address evidence for attentional asymmetries across the left and right visual fields due to hemispheric specialization (e.g., Bartolomeo & Seidel Malkinson, 2019). Whether effects differ for left versus right hemifield arrangements is not made explicit in the presented findings.

      (4) Individual differences: Individual differences in attentional modulation are a strength of the work, but similar analyses exploring individual variation in lateralization effects could provide further insight, and the lack of such analyses may mask important effects.

      (5) Distinction between overt and covert attention: The current report at times equates eye movement patterns with the locus of attention. However, attention can be covertly shifted without corresponding gaze changes (see, for example, Pooresmaeili & Roelfsema, 2014).

      The implications for interpreting the relationship between eye movement, memory, and attention in this setting are not fully addressed. The potential dynamics of attention along a maze trajectory and their impact on lateralization analysis would benefit from further clarification.

      Appraisal of Aims and Results:

      The study sets out to determine how spatial attention shapes the construction of task representations in planning contexts. The authors provide evidence that spatial proximity and arrangement influence which environmental features are incorporated into internal models used for navigation, and that accounting for these effects improves model predictions. There is clear documentation of individual variation, with some participants showing greater attentional spillover and more sparse awareness profiles.

      However, some conceptual and methodological aspects would be clearer with greater engagement with the broader literature on attention dynamics, a more explicit justification of operational choices, and more targeted lateralization analyses.

    8. Reviewer #2 (Public review):

      Summary:

      Castanheira et al. investigate the role of spatial attention for planning during three maze navigation experiments (one new experiment and two existing datasets). Effective planning in complex situations requires the construction of simplified representations of the task at hand. The authors find that these mental representations (as assessed by conscious awareness) of a given stimulus are influenced by (spatially) surrounding stimuli. Individual participants varied in the degree to which attention influenced their task representations, and this attentional effect correlated with the sparsity of representations (as measured by the range of awareness reports across all stimuli). Spatially grouping task-relevant information on either the left or right side of the maze led to mental representations more similar to optimal representations predicted by the value-guided construal (VGC) model - a normative model describing a theoretical approach to simplifying complex task information. Finally, the authors propose an update to this model, incorporating an attentional spotlight component; the revised descriptive model predicts empirical task representations better than the original (normative) VGC model.

      Strengths:

      The novelty of this study lies in the proposal and investigation of a cognitive mechanism through which a normative model like value-guided construal can enable human planning. After proposing attention as this mechanism, the authors make concrete hypotheses about mismatches between the VGC predictions and real human behavior, which are experimentally validated. Thus, not only does this study describe a possible mechanism for simplification of task information for planning, but the authors also propose a descriptive model, revising VGC to incorporate this attentional component.

      A strength of this paper is the variety of investigative approaches: analysis of existing data, novel experiment, and a computational approach to predict experimental findings from a theoretical model. Analyzing pre-existing datasets increases the size of the participant cohort and strengthens the authors' conclusions. Meanwhile, comparing the predictions of the existing normative model and the authors' own refined model is a clever approach to substantiate their claims. In addition, the authors describe several crucial controls, which are key to the interpretability of their results. In particular, the eye tracking results were critical.

      In summary, this paper constitutes an important step toward a more complete understanding of the human ability to plan.

      Weaknesses:

      (1) There is a critical conceptual gap in the study and its interpretation, mainly due to the reliance on a self-report metric of awareness (rather than an objective measure of behavioral performance).

      a. Awareness is tested by a 9-point self-report scale. It is currently unclear why awareness of task-irrelevant obstacles in this task would necessarily compromise optimal planning. There is no indication of whether self-reported awareness affects performance (e.g., navigation path distance, time to complete the maze, number of errors). Such behavioral evidence of planning would be more compelling.

      b. Relatedly, it would have been more convincing to have an objective measure of awareness, for instance, how the presence or absence of a "task-irrelevant" obstacle affects performance (e.g., change navigation path distance or time to complete the maze), or whether participants can accurately recall the location of obstacles.

      c. Consequently, I'm not sure that we can conclude that the spatial context does impact participants' ability to plan spatial navigation or to "incorporate task-relevant information into their construal". We know that the spatial context affects subjective (self-reported) awareness, but the authors do not present evidence that spatial context affects behavioral performance.

      d. Another concern that may complicate interpretation is the following: Figure 3c shows improved VGC model predictions (steeper slope) for mazes with greater lateralization. However, there are notable outliers in these plots, where a high lateralization index does not correspond to good model performance. There is currently no discussion/explanation of these cases.

      (2) I noticed an issue with clarity regarding task-relevance. It is currently not fully clear which obstacles are "task irrelevant". Also, the term is used inconsistently, sometimes conflating with "awareness". For example, in the "Attentional spotlight model of task representations" section, the authors state that "task-relevant information becomes less relevant when surrounded by task-irrelevant information". But they really mean that participants become less aware of those task-relevant obstacles. I assume task-relevance is an objective characteristic related to maze organization, not to a participant's construal. Indeed, the following paragraph provides evidence of model predictions of awareness.

      (3) The behavioral paradigm has some distinct disadvantages, and the validity of the task is not backed up by behavioral data.

      a. I understand the need for central fixation, but it also makes the task less naturalistic.

      b. The task with its top-down grid view does not seem to mimic real human navigation. Though this grid may be similar to mental maps we form for navigation, the sensory stimuli corresponding to possible paths and to spatial context during real-life navigation are very different.

      c. Behavioral performance is not reported, so it is unknown whether participants are able to properly complete the task. The task seems pretty difficult to navigate, especially when the obstacles disappear, and in combination with the central fixation.

      d. There is no discussion of whether/how this navigation task generalizes to other forms of planning.

    9. Reviewer #3 (Public review):

      Summary:

      The authors build on a recent computational model of planning, the "value-guided construal" framework by Ho et al. (2022), which proposes that people plan by constructing simple models of a task, such as by attending to a subset of obstacles in a maze. They analyze both published experimental data and new experimental data from a task in which participants report attention to objects in mazes. The authors find that attention to objects is affected by spatial proximity to other objects (i.e., attentional overspill) as well as whether relevant objects are lateralized to the same hemifield. To account for these results, the authors propose a "spotlight-VGC" model, in which, after calculating attention scores based on the original VGC model, attention to objects is enhanced based on distance. They find that this model better explains participant responses when objects are lateralized to different hemifields. These results demonstrate complex interactions between filtering of task-relevant information and more classical signatures of attentional selection.

      Strengths:

      (1) The paper builds on existing modeling work in a novel manner and integrates classic results on attention into the computational framework.

      (2) The authors report new and extensive analyses of existing data that shed light on additional sources of systematic variability in responses related to attentional spillover effects

      (3) They collect new data using new stimuli in the original paradigm that directly test predictions related to the lateralization of task-relevant information, including eye tracking data that allows them to control for possible confounds.

      (4) The extended model (spotlight-VGC) provides a formal account of these new results.

      Weaknesses:

      (1) The spotlight-VGC model has a free parameter - the "width" of the attentional spotlight. This seems to have been fixed to be 3 squares. It would be good if the authors could describe a more principled procedure for selecting the width so that others can use the model in other contexts.

      (2) Have the authors considered other ways in which factors such as attentional spillover and lateralization could be incorporated into the model? The spotlight-VGC model, as presented, involves first computing VGC predictions and only afterwards computing spillover. This seems psychologically implausible, since it supposes that the "optimal" representation is first formed and then it gets corrupted. Is there a way to integrate these biases directly into the VGC framework, perhaps as a prior on construals? The authors gesture towards this when they talk about "inductive biases", but this is not formalized.

      (3) Can the authors rule out that the lateralization effects are the result of memory biases since the main measure used is a self-report of attention?

    1. eLife Assessment

      This study offers a valuable analysis of how moment-to-moment fluctuations in arousal are associated with structured, non-uniform patterns of brain-wide functional connectivity during wakefulness. Using data-driven analyses of resting-state and naturalistic fMRI with eye tracking, the authors present convincing evidence that arousal is a dynamic, continuous process that shapes brain activity in a structured way beyond a simple global effect. This paper sheds light on the link between brain activity and ongoing fluctuations in arousal and will be of interest to researchers studying large-scale brain functional organization and links between the brain and body.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors aim to characterize how moment-to-moment fluctuations in arousal during wakefulness shape large-scale functional brain connectivity. Using pupil diameter as an index of arousal and high-field functional imaging, they seek to determine whether arousal-related modulation of connectivity is uniform across the brain or organized into structured patterns, and whether such patterns show hemispheric asymmetry. The work further aims to assess whether these organizational features generalize across resting-state and naturalistic viewing conditions.

      Strengths:

      The study addresses an important and timely question regarding how spontaneous variations in arousal influence whole-brain communication during wakefulness. The dataset is rich, combining high-field imaging with concurrent physiological measurements, and the analyses are ambitious in scope. A key strength is the attempt to move beyond region-based effects and to describe arousal-related modulation at the level of large-scale connectivity organization. The comparison across rest and movie viewing provides useful context and suggests a degree of consistency across behavioral states.

      Weaknesses

      All analyses are based on 7T ultra-high-field imaging. The manuscript does not address whether the reported arousal-related patterns, including the community structure and hemispheric asymmetries, are expected to be reproducible at standard 3T field strengths. It therefore remains unclear whether the findings depend critically on the use of high-field data or whether they would generalize to more widely available datasets, limiting the broader applicability of the results.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript addresses a clear and widely relevant question: how ongoing fluctuations in alertness during wakefulness relate to large scale patterns of coordinated brain activity. The authors combine high field magnetic resonance imaging with simultaneous pupil measurements, and they compute an edgewise measure of arousal-related coupling for every pair of regions. Their main contribution is to show that arousal-related coupling is low dimensional and organized into seven reproducible "connectivity communities", each with characteristic network pair compositions. A secondary contribution is the observation that these communities exhibit systematic but community-specific hemispheric asymmetries, including a striking left/right dissociation within the ventral attention network, where the left side participates broadly across communities while the right side forms a more cohesive, segregated arousal responsive module. A final contribution is cross-context generalization: the same organizational structure and lateralization signatures are largely preserved during naturalistic movie watching.

      Strengths:

      (1) The paper moves beyond state contrasts and quantifies arousal related modulation continuously within wakefulness, directly addressing a gap highlighted in the Introduction.

      (2) The hemispheric asymmetry result is not framed as a crude global dominance effect; the authors explicitly test and argue that the key signal lies in structured spatial heterogeneity rather than mean shifts.

      (3) The cross-paradigm replication in movie watching is a strong design choice and supports the claim that the organizational motifs are not limited to unconstrained rest.

      (4) Arousal effects on BOLD signals and on pupil size can have different delays. The authors have now tested lagged relationships (for example shifting the pupil series forward and backward) to show that the main community structure and lateralization results are not sensitive to an arbitrary temporal alignment.

      (5) Time resolved connectivity results are now shown to be robust to changes in parameters.

    4. Reviewer #3 (Public review):

      Summary:

      The paper investigates neural fluctuations underlying arousal using a combination of resting state/naturalistic movie watching fMRI and eye tracking data. The authors have used several data driven approaches, including time varying sliding window analyses and clustering methods, to characterize large scale brain organization and hemispheric asymmetries associated with arousal fluctuations. This is an interesting study framing arousal as a dynamic, continuously varying process rather than a discrete state. Overall, the manuscript is well written and the authors have provided sufficient details about the methodological choices, their impact on the results, along with the limitations of the study.

      Strengths:

      This is an interesting study framing arousal as a dynamic, continuously varying process rather than a discrete state. Overall, the manuscript is well written and provides sufficient methodological and analytical details to evaluate the results.

      Weakness:

      While the study provides new insights regarding neural processes underlying arousal, future studies may be needed to further examine the implications of identified cluster and patterns.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) First, a central claim is that arousal modulates functional connectivity in a hemispherically asymmetric and community-specific manner. Although structured asymmetries are demonstrated at the group level, it remains unclear whether these effects reflect a stable neurobiological principle or arise from high-dimensional, connection-wise analyses that are sensitive to sampling variability. Given the interpretive weight placed on hemispheric lateralization, stronger evidence of robustness and individual-level consistency would be necessary to support this conclusion.

      We appreciate your critical comments on the robustness of our lateralization findings. We fully agree with you that it is essential to demonstrate that the observed hemispheric asymmetries reflect a stable neurobiological principle rather than an artifact of sampling variability or high-dimensional noise. To address this concern, we performed two rigorous validation analyses using 500-iteration resampling schemes, consisting of a split-half reliability test and a participant-level consistency assessment.

      First, to ensure our findings do not depend on specific sample compositions, we conducted a split-half reliability test where the dataset was randomly partitioned into two independent subgroups over 500 iterations. As shown in Figure S1A, the community labels maintained high spatial consistency across iterations (as evidenced by the confusion matrix and Dice coefficient distributions), and our original findings—including network-pair community architecture (Fig. S2A), regional affiliation patterns (Fig. S3A-B), and arousal–tvFC coupling lateralization (Fig. S4A-B)—were consistently situated at the center of the iteration distributions.

      Second, to account for potential within-participant dependencies in the HCP 7T dataset, we performed a participant-level resampling analysis (N = 139). By randomly selecting a different session for each participant across 500 iterations, we confirmed that the community architecture and hemispheric biases remain robust even under this strict control (Figure S1A, S2B, S3C-D and S4C-D). Collectively, these additional analyses provide strong evidence that the hemispheric lateralization we reported is not a byproduct of sampling bias, but instead represents a stable organizational principle of the arousal-modulated connectome.

      (2) Second, all analyses are based on ultra-high-field imaging. The manuscript does not address whether the reported arousal-related patterns, including the community structure and hemispheric asymmetries, are expected to be reproducible at standard field strengths. It therefore remains unclear whether the findings depend critically on the use of high-field data or whether they would generalize to more widely available datasets, limiting the broader applicability of the results.

      We appreciate your constructive comments on the generalizability of our findings across different field strengths.

      As you noted, our primary motivation for employing 7T ultra-high-field imaging was to leverage its superior signal-to-noise ratio (SNR) and significantly enhanced BOLD sensitivity. These technical advantages were instrumental in capturing the subtle, moment-to-moment coupling between spontaneous pupillary fluctuations and tvFC—signals that might be close to the detection threshold in standard field strength environments.

      However, we fully recognize your point that 3T remains the standard in most clinical and research settings. In the revised manuscript, we have added a dedicated discussion to address this (page 21, lines 447-456):

      “Fifth, the findings reported here were derived exclusively from ultra-high-field (7T) imaging data. The superior BOLD sensitivity of 7T fMRI was instrumental in resolving the fine-scale community architecture of arousal–tvFC coupling, which involves subtle signals that may be challenging to detect at lower field strengths. Given that 3T remains the most common parameter for neuroimaging research and clinical applications, future investigations are needed to determine the extent to which these organizational principles generalize to standard field strength data. Validating these motifs in large-scale 3T datasets will be essential to establish their broader applicability across different imaging environments.”

      (3) Third, arousal-connectivity coupling is assessed using zero-lag correlations between pupil diameter and time-resolved connectivity estimates. Physiological and hemodynamic considerations suggest that pupil-linked arousal and blood-based imaging signals may exhibit systematic temporal delays. The absence of analyses examining sensitivity to such delays raises the possibility that the reported coupling patterns depend on a specific temporal alignment assumption.

      Given the inherent delay of the hemodynamic response function (HRF) and the complex temporal relationship between pupillary dynamics and neural activity, we conducted an additional lagged cross-correlation analysis to test the sensitivity of our findings. Following established frameworks for linking BOLD signals with pupillometry (Yellin et al., 2015; Gonzalez-Castillo et al., 2022; Lloyd et al., 2023), we systematically shifted the pupil time series relative to the fMRI data by -3 TR to +3 TR (-3s to +3s) and evaluated the consistency of the community architecture across these different lags using Dice coefficients.

      As shown in Figure S5, these results demonstrate that the community organization remain stable across the tested range of physiological delays. This stability indicates that the arousal-modulated communities we reported are not specific to the zero-lag assumption but instead persist throughout the physiologically plausible lag window. Consequently, our findings reflect a robust neurobiological phenomenon rather than an artifact of a specific temporal alignment.

      (4) Fourth, the estimation of time-resolved connectivity relies on a single choice of sliding-window length. The manuscript does not examine whether the reported patterns are stable across different window sizes. Given ongoing concerns about parameter dependence in time-resolved connectivity analyses, sensitivity analyses would be important to establish that the findings are not artifacts of a particular analytical choice.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S5, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. These findings provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data rather than being driven by specific analytical choices in the sliding-window setup.

      (5) Finally, the identification of seven connectivity communities is a central result, yet the justification for this choice relies primarily on a single clustering quality measure. In practice, evaluation of clustering solutions typically draws on multiple complementary criteria, including measures of compactness and separation, approaches for selecting the number of clusters, and assessments of stability under resampling. Without such complementary evaluations, it is difficult to determine whether the reported community structure reflects a stable organizational feature or sensitivity to specific methodological decisions.

      We agree that relying on a single measure can be limiting, and in the revised manuscript, we have implemented a comprehensive multi-criteria evaluation to justify our selection of K=7. To ensure the robustness of the community partition, we expanded our analysis to include several complementary indices, such as the Davies-Bouldin Index, Calinski-Harabasz Score, and Silhouette Coefficient, alongside the original Within-Cluster Sum of Squares (WCSS), as detailed in Figure S7A.

      To further minimize subjective bias in "elbow" detection, we utilized the L-method (Salvador & Chan, 2004), which identifies the optimal K by minimizing the combined root-mean-square error (RMSE) of two linear regression segments. As illustrated in Figure S7B, the RMSE was minimized at K=7, providing a robust mathematical basis for our partition. Furthermore, we systematically visualized the community maps across a range of granularities from K=5 to 9 (Figure S7C). This stability analysis demonstrates that the fundamental topological features and the resulting hemispheric asymmetries are not transient artifacts of a specific K but are consistently preserved as the clustering granularity increases. These additional evaluations demonstrate that the seven-community structure reflects a stable organizational feature of arousal-modulated connectivity

      Reviewer #2 (Public review):

      (1) Arousal effects on BOLD signals and on pupil size can have different delays, so it would be valuable to test lagged relationships (for example, shifting the pupil series forward and backward) to show that the main community structure and lateralization results are not sensitive to an arbitrary temporal alignment.

      We agree with you that accounting for the varying delays between BOLD signals and pupillary dynamics is essential for ensuring the robustness of our results. We conducted a comprehensive lagged cross-correlation analysis to address it. Following established frameworks for linking BOLD signals with pupillometry (Yellin et al., 2015; Gonzalez-Castillo et al., 2022; Lloyd et al., 2023), we systematically shifted the pupil time series relative to the fMRI data by -3 TR to +3 TR (-3s to +3s) and evaluated the consistency of the community architecture across these lags using Dice coefficients.

      As shown in Figure S5C, these results demonstrate that the core community organization remain stable across the tested range of physiological delays. This stability confirms that our findings are not sensitive to an arbitrary temporal alignment but instead reflect a robust neurobiological phenomenon that persists throughout the physiologically plausible lag window.

      (2) Pupil diameter covaries with blinks, eye closure, and other factors that can covary with head motion and physiological noise. The Methods include substantial quality control and denoising, including motion regression and scrubbing, plus exclusions for eye closure.

      We appreciate your attention to these potential confounding factors. While we implemented rigorous preprocessing including regressing out confounds on fMRI images, we agree that physiological noise and motion may influenced pupil signals.

      To address this, we conducted an additional control analysis where we included head motion (framewise displacement, FD) and the global signal (defined as the mean signal across all gray matter voxels) as covariates when calculating the arousal–tvFC coupling. We then re-evaluated the similarity between the resulting community architecture and our original findings. As shown in Figure S4, the community structure remained stable after controlling for these variables.

      Regarding eye closure, we intentionally did not regress this out, as extensive literature demonstrates that eye closure is itself a reliable physiological proxy for arousal levels (Sommer & Golz, 2010; Chang et al., 2016; Gonzalez-Castillo et al., 2022); regressing it out would likely remove the very arousal-related coupling effects we aim to investigate.

      (3) The dataset is described in terms of runs retained (for example, 485 resting runs), and runs are treated as observations in clustering after z-scoring across runs. If multiple runs come from the same individuals, the manuscript would benefit from explicitly showing that results replicate at the participant level (for example, community structure stability within participant across runs, and participant-level summary statistics used for inference), rather than relying primarily on pooled run-level patterns.

      We fully agree with you that it is essential to demonstrate that the observed hemispheric asymmetries reflect a stable neurobiological principle rather than an artifact of sampling variability or high-dimensional noise. To address this concern, we performed two rigorous validation analyses using 500-iteration resampling schemes, consisting of a split-half reliability test and a participant-level consistency assessment.

      First, to ensure our findings do not depend on specific sample compositions, we conducted a split-half reliability test where the dataset was randomly partitioned into two independent subgroups over 500 iterations. As shown in Figure S1A, the community labels maintained high spatial consistency across iterations (as evidenced by the confusion matrix and Dice coefficient distributions), and our original findings—including network-pair community architecture (Fig. S2A), regional affiliation patterns (Fig. S3A-B), and arousal–tvFC coupling lateralization (Fig. S4A-B)—were consistently situated at the center of the iteration distributions.

      Second, to account for potential within-participant dependencies in the HCP 7T dataset, we performed a participant-level resampling analysis (N = 139). By randomly selecting a different session for each participant across 500 iterations, we confirmed that the community architecture and hemispheric biases remain robust even under this strict control (Figure S1A, S2B, S3C-D and S4C-D). Collectively, these additional analyses provide strong evidence that the hemispheric lateralization we reported is not a byproduct of sampling bias, but instead represents a stable organizational principle of the arousal-modulated connectome.

      (4) Time-resolved connectivity is estimated using a 30-second sliding window and 5 second step. It is reasonable to wonder whether the same conclusions hold with alternative estimators that do not rely on fixed windows. The Discussion acknowledges this limitation, but adding a small robustness analysis would make the paper more definitive.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S3, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. Furthermore, the core hemispheric asymmetry patterns were robustly preserved regardless of the specific windowing configuration used. These results provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data and are stable across a broad range of temporal scales.

      Reviewer #3 (Public review):

      (1) A major limitation of the study is the limited discussion of subcortical regions, which play a central role in arousal regulation according to extensive prior literature. Although the current analyses focus primarily on cortical organization, the authors should include a brief discussion of how their findings relate to subcortical arousal systems.

      We completely agree that subcortical structures are pivotal drivers of arousal regulation. While our study primarily utilized a symmetric cortical atlas to ensure a mathematically rigorous assessment of hemispheric lateralization, we recognize that the exclusion of subcortical regions limits the functional interpretation of the observed patterns.

      In the revised manuscript, we have added a dedicated discussion part (page 20, lines 412-428) to address this point:

      “First, to ensure a mathematically rigorous assessment of hemispheric asymmetry, our analysis was restricted to a symmetric cortical parcellation. Consequently, while we demonstrate that arousal-modulated connectivity follows a structured macroscopic architecture, we did not explicitly analyze the subcortical nuclei hypothesized to drive these patterns. We hypothesize that the presence of these low-dimensional cortical communities reflects coordinated motifs rather than a homogeneous gain modulation, potentially mirroring the differentiated projection patterns of subcortical neuromodulatory systems. For instance, the locus coeruleus–noradrenergic pathway (Chandler et al., 2014; Schwarz & Luo, 2015) and thalamus (Hwang et al., 2017; Shine, 2019; Müller et al., 2020; Shine et al., 2023) possess extensive yet non-uniform projections that may anchor the community-specific and hemispherically asymmetric patterns observed here. “

      (2) While sliding window methods can capture temporal changes in functional organization, they have limitations in characterizing moment-to-moment neural fluctuations. In particular, results can be highly sensitive to window length and step size. The manuscript would benefit from (a) a clearer discussion of these methodological limitations, (b) justification for the chosen window length and step size, and (c) a sensitivity analysis demonstrating whether the main findings are robust across different parameter choices.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S5, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. Furthermore, the core hemispheric asymmetry patterns were robustly preserved regardless of the specific windowing configuration used. These results provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data and are stable across a broad range of temporal scales.

      (2) The authors use k-means clustering to identify groups of brain regions and refer to these groupings as "communities." However, in general, community detection typically refers to graph-based algorithms that identify modules based on connectivity structure (e.g., modularity maximization). The clusters derived from k-means in feature space are not necessarily equivalent to graph-theoretic communities. The authors should explicitly clarify this distinction and adjust terminology accordingly to avoid conceptual ambiguity.

      We agree that the term "community detection" is often specifically associated with graph-based algorithms, such as modularity maximization, which define modules based on topological connectivity. In contrast, our implementation of k-means identifies groupings based on the similarity of arousal–FC coupling patterns within a high-dimensional feature space.

      To avoid any conceptual ambiguity or potential confusion, we have explicitly clarified this distinction in the Methods (pages 24-25, lines 533-542) section of the revised manuscript:

      “We employed the k-means clustering algorithm (Euclidean distance) to explore a range of cluster solutions from K = 2 to 15. To ensure the stability of the results and avoid local optima, each K was repeated 250 times with random initializations. The optimal number of clusters was determined by evaluating clustering quality and reproducibility (e.g., maximizing silhouette stability). It is important to clarify that "communities" in this context refer to clusters of edges that exhibit similar arousal-modulation motifs within a high-dimensional feature space, rather than topological modules typically derived from graph-theoretic algorithms like modularity maximization. This procedure consistently identified seven distinct communities, each representing a robust, arousal-sensitive connectivity motif that characterizes the large-scale organization of brain-pupil coupling.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) To strengthen confidence in the reported hemispheric effects, the authors should provide additional robustness analyses, such as subject-level consistency of lateralization measures, split-half or resampling reliability, and sensitivity to alternative preprocessing or analysis choices. Reporting the distribution of lateralization effects across individuals would help clarify whether the observed asymmetries reflect stable features or group-level averages driven by a subset of connections or participants.

      We agree that establishing the individual-level stability of lateralization is essential. We have now provided extensive validation, including split-half reliability tests and participant-level consistency analyses (500 iterations). These results confirm that the reported asymmetries are robust and consistent across the sample. Please refer to Reviewer #1 Weakness2 for the full analysis and associated figures (Figure. S1-S4).

      (2) The authors should examine whether arousal-connectivity coupling patterns are robust to plausible temporal delays between pupil diameter and BOLD signals. Lagged or time-shifted analyses would help establish that the findings do not depend on a specific zero-lag assumption.

      We agree that validating the coupling between pupil dynamics and the time varying FC is essential. To address this, we conducted a lag sensitivity analysis by shifting the pupil-derived arousal signal within a physiologically plausible range (-3 to +3 TR). The community architecture remains highly consistent across these temporal offsets, showing high spatial correlation and Dice coefficients with our original findings. This stability confirms that the identified organizational motifs are robust and not dependent on a specific zero-lag assumption. For the full details of this validation and the associated figures, please refer to Reviewer #1 Weakness3 and Figure S5 in the Supplementary Material.

      (3) Given reliance on a single sliding-window length, the authors should assess how key results vary across different window sizes. Demonstrating stability of the community structure and lateralization patterns across parameter choices would strengthen the methodological foundation of the study.

      We have conducted an exhaustive sensitivity analysis across various window lengths (30s, 35s, 60s, 90s) and step sizes (1s, 5s, 10s). The high Dice coefficients (>0.8) confirm that our findings are not dependent on specific windowing choices. Please refer to Reviewer #1 Weakness3 and Figure S5 for the full results.

      (4) The justification for the chosen number of connectivity communities would benefit from additional clustering evaluations. Complementary criteria such as measures of compactness and separation, model selection approaches for determining the number of clusters, and stability or reproducibility under resampling would help establish whether the reported community structure is robust rather than method-dependent.

      To strengthen the mathematical basis for our partition, we have implemented a multi-metric evaluation and the L-method for objective K selection. These metrics consistently support the seven-community structure. Please refer to our response to Reviewer #1 Weakness5 and Figure S7 for the comprehensive evaluation.

      (5) The manuscript would benefit from a clearer discussion of why ultra-high-field imaging was required for the present analyses and whether similar results are expected at standard field strengths. If feasible, validation using lower-field data or reference to existing datasets would substantially enhance generalizability.

      We have expanded our discussion to clarify that 7T was instrumental for capturing the subtle, high-frequency arousal-tvFC coupling due to its superior SNR. We also explicitly discuss the potential and limitations of generalizing these findings to 3T datasets. Please refer to our response to Reviewer #1 Weakness2 for the full discussion (page 21, lines 447-456).

      (6) The authors should more explicitly report exclusion related to pupil measurements and discuss how missing or noisy pupillometry may affect the applicability of the approach in other datasets or experimental settings.

      We agree that transparency in data screening is essential for the reproducibility of our method. In the revised manuscript, we have clarified our quality control pipeline in the quality control section in Methods (page 23, lines 502-510):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female). Runs were excluded if (a) more than 20% of frames exceeded motion thresholds, (b) eye tracking did not cover the full fMRI time series, or (c) more than 90% of samples were classified as eye closure. After applying these criteria, 485 of the initial 723 scans were retained for analysis. The same quality-control pipeline was applied to the movie-watching dataset, yielding 513 usable scans out of the original 725. Detailed information on data retention and run distribution per participant is summarized in Figure S9.”

      Furthermore, we have added a discussion regarding how noisy or missing pupillary signals might affect the generalizability of our approach (pages 20-21, lines 437-447):

      “Fourth, the generalizability of our approach to external cohorts warrants caution regarding pupillary data integrity. In contexts where high-fidelity eye-tracking is technically demanding—such as in clinical settings involving patients with restricted compliance or in naturalistic fMRI studies—the prevalence of blink artifacts and signal dropouts may bias the estimation of arousal-modulated states. Excessive reliance on data interpolation in such cases could artificially smooth temporal fluctuations, leading to an overestimation of community stability. Future applications should therefore prioritize high-frequency sampling and potentially incorporate multi-modal physiological features (e.g., respiratory or cardiac signals) to cross-validate arousal dynamics when pupillary data is suboptimal (Meissner et al., 2023; Bolt et al., 2025; Weijs et al., 2025).”

      (7) The authors should ensure that all data and analysis code necessary to reproduce the results are made publicly available in accordance with eLife policies, including clear documentation of preprocessing steps, parameter choices, and clustering procedures.

      All analysis code and the necessary processed data required to reproduce our findings have been made publicly available through https://github.com/kongxy6478/Arousal-modulates-functional-connectivity. This repository includes documented pipelines for pupillometry cleaning and fMRI denoising, alongside the core Python scripts used for sliding-window connectivity calculation, k-means clustering, and hemispheric lateralization analysis.

      Reviewer #2 (Recommendations for the authors):

      (1) Add a lag sensitivity analysis between pupil-derived arousal and time-resolved connectivity, and report whether the seven community structure and key lateralization findings are stable across a plausible lag range.

      We agree that validating the coupling between pupil dynamics and the time varying FC is essential. To address this, we conducted a lag sensitivity analysis by shifting the pupil-derived arousal signal within a physiologically plausible range (-3 to +3 TR). The community architecture remains highly consistent across these temporal offsets, showing high spatial correlation and Dice coefficients with our original findings. This stability confirms that the identified organizational motifs are robust and not dependent on a specific zero-lag assumption. For the full details of this validation and the associated figures, please refer to Reviewer #1 Weakness3 and Figure S5 in the Supplementary Material.

      (2) Quantify and report the extent to which residual head motion, blink rate, eye closure segments, and global signal changes explain arousal connectivity coupling, for example, via partial correlation or regression controls, and show that key effects persist.

      We agree that it is essential to demonstrate that the observed arousal-connectivity coupling is not driven by non-specific physiological or motion-related artifacts. As requested, we have quantified the influence of head motion (FD) and global signal on our primary results. By implementing partial correlation analyses, we confirmed that the identified arousal-modulated community structures persist even after strictly controlling for these variables. These results indicate that the arousal-tvFC coupling we report reflects a specific neuro-arousal process rather than a byproduct of motion or systemic physiological fluctuations. For the detailed quantitative results and control analysis figures, please refer to our response to Reviewer #2 Weakness3 and Figure S6 in the Supplementary Material.

      (3) Add participant-level validation: demonstrate that community profiles and lateralization signatures are consistent within participants across runs, and consider participant-level statistical summaries rather than treating all runs as independent observations.

      We agree that demonstrating participant-level consistency is vital. In response, we performed two rigorous 500-iteration resampling schemes: a split-half reliability test and a participant-level consistency assessment (N = 139). These analyses, which involved randomly partitioning the sample and selecting single sessions per participant, confirm that our community architecture and hemispheric biases are remarkably stable and not driven by sampling variability or high-dimensional noise. For a comprehensive description of these validations and the associated statistical distributions, please refer to our detailed response to Reviewer #2 Weakness3 and Figures S1–S4.

      (4) Provide an alternative dynamic connectivity estimator robustness check, or at a minimum, vary the window length and step size to show stability of the primary conclusions.

      We have conducted an exhaustive sensitivity analysis across various window lengths (30s, 35s, 60s, 90s) and step sizes (1s, 5s, 10s). The high Dice coefficients (>0.8) confirm that our findings are not dependent on specific windowing choices. Please refer to Reviewer #1 Weakness3 and Figure S5 for the full results.

      (5) Consider validating the seven community solutions with at least one additional unsupervised approach, and report agreement with the main k-means solution.

      We agree that validating the clustering scheme is essential. To this end, we implemented a multi-criteria evaluation (including Davies-Bouldin and Silhouette indices) and utilized the L-method (Salvador & Chan, 2004) to mathematically confirm K=7 as the optimal granularity (Figure S7A–B). Furthermore, we verified that the core topological features and hemispheric asymmetries remain robustly consistent across a range of granularities from K=5 to 9 (Figure S7C). These analyses demonstrate that our findings are not dependent on a specific K or subjective bias. For the full quantitative evaluation and stability maps, please refer to our response to Reviewer #2 Weakness5 and Figure S7.

      (6) State explicitly, early in Results, what the main inferential unit is (run or participant) for each key analysis, and clarify how repeated runs per participant are handled.

      We agree that defining the inferential unit is critical for methodological clarity. In the revised manuscript, we have explicitly stated at the beginning of the Results section (page 5, lines 113-116):

      “While our primary inferential analyses were conducted at the run level to leverage the high-density sampling of the HCP 7T dataset, we further validated the robustness of these findings using participant-level statistical summaries and resampling to account for within-participant dependencies (see Figure. S1-S2 in Supplementary Materia).”

      Specifically, all key findings—including community architecture and hemispheric asymmetries—were validated using participant-level statistics and resampling schemes (N = 139) to ensure that the results are not biased by within-participant dependencies.

      (7) When introducing the integration and segregation indices, add a brief intuitive explanation of what a positive or negative value means in plain language before the equations.

      We thank the reviewer for this suggestion to improve the accessibility of our methods. We have added brief, intuitive explanations for both indices in the Methods section (pages 26-27, lines 569-582):

      “The integration index provides a measure of the overall hemispheric dominance of arousal-modulated connections. A positive value indicates that arousal-related edges are preferentially concentrated in the left hemisphere (including its internal and outgoing connections) compared to the right.” and “The segregation index assesses whether arousal preferentially modulates local, intra-hemispheric communication versus long-range, inter-hemispheric communication. A positive value reflects a "segregated" left-hemisphere bias, where arousal strengthens within-hemisphere connections more than it strengthens across-hemisphere communication for that same hemisphere. “

      (8) In the Discussion, separate claims into "what we show" versus "what we hypothesize," especially when connecting findings to neuromodulatory pathways.

      In the revised manuscript, we have carefully separated our direct empirical findings from our mechanistic hypotheses. we have utilized more cautious and speculative language (e.g., "suggesting a potential role of," "may be mediated by," and "we hypothesize that”) (page 17, lines 352-358):

      “Specifically, we show the presence of low-dimensional, reproducible communities suggests that arousal modulates the connectome through coordinated motifs rather than homogeneous gain modulation. We hypothesize that this structured macroscopic architecture reflects the differentiated projection patterns of subcortical neuromodulatory systems, such as the locus coeruleus–noradrenergic pathway (Aston-Jones & Cohen, 2005; Jordan, 2024) and thalamus (Magnin et al., 2010; Lewis et al., 2015; Liu et al., 2018)”

      (9) Provide a clear participant-level summary (number of participants contributing to the retained runs, demographics if available, and distribution of runs per participant), alongside the reported run counts retained after quality control.

      We agree that clear reporting of participant-level data is essential. In the revised Methods section, we have added a detailed summary of participant demographics (age and sex) and clarified the sample composition (page 23, lines 502-503):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female).”

      Furthermore, to provide a transparent view of the data retained after quality control, we have included Figure S9 to illustrate the distribution of valid runs per participant. This visualization confirms the amount of data contributing to our group-level inferences and accounts for exclusions due to motion or pupillary signal quality.

      (10) Report the robustness of results to reasonable changes in pupil preprocessing choices (for example, smoothing parameters or interpolation rules), since pupil diameter is the key arousal index.

      We agree that the robustness of pupil-derived arousal estimates is fundamental to our findings. To address this, we conducted an extensive validation analysis by comparing our original pupil preprocessing pipeline against 18 alternative combinations of parameters. These variations included different smoothing window sizes (100 ms, 200 ms, and 500 ms), interpolation methods (linear vs. cubic spline), and blink buffer durations (25 ms, 50 ms, and 100 ms). As shown in Figure S8, the pupil diameter time courses derived from these diverse pipelines remained highly correlated with our original estimates (all above 0.65). This demonstrates that our arousal-modulated connectivity results are remarkably robust to reasonable changes in pupil preprocessing choices.

      Reviewer #3 (Recommendations for the authors):

      I have two additional minor comments:

      (1) Given the overall goal of this study to identify large-scale brain communities or clusters underlying arousal, the results may be sensitive to the choice of cortical parcellation. The authors should consider:

      (a) including analyses using additional parcellation schemes, or

      (b) discussing how the current findings might depend on the chosen parcellation and the implications for robustness and generalizability.

      We have addressed this by adding a dedicated point in the Discussion (page 21, lines 456-465):

      “Sixth, our findings were derived using a single high-resolution cortical parcellation. While the specific choice of atlas can influence fine-grained regional connectivity, it is important to note that our primary conclusions—such as hemispheric asymmetries and community-level preferences—were identified and interpreted at the macroscopic network and system level. By aggregating signals across broad functional systems, this approach likely mitigates the dependency on precise regional boundary definitions. Nevertheless, future studies employing alternative parcellation schemes would be valuable to further confirm that these organizational principles are not specific to the current atlas but represent a generalizable feature of the arousal-modulated connectome.”

      (2) Some key details, such as the number of participants included in the study, as well as basic demographic information, are not reported.

      We apologize for this omission. In the revised Methods section, we have now included a detailed summary of the participant demographics, including the final sample size (N = 139), age, and sex distribution (page 23, lines 502-503):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female)”

      Furthermore, to ensure full transparency regarding data retention, we have added a new figure (Figure S9) illustrating the distribution of valid fMRI runs per participant following our quality-control procedures. We believe these additions provide a clear and complete overview of the study sample.

      Reference

      Aston-Jones, G., & Cohen, J. D. (2005). AN INTEGRATIVE THEORY OF LOCUS COERULEUS-NOREPINEPHRINE FUNCTION: Adaptive Gain and Optimal Performance. In Annual Review of Neuroscience (Vol. 28, Issue Volume 28, 2005, pp. 403–450). Annual Reviews. https://doi.org/10.1146/annurev.neuro.28.061604.135709

      Bolt, T., Wang, S., Nomi, J. S., Setton, R., Gold, B. P., deB.Frederick, B., Yeo, B. T. T., Chen, J. J., Picchioni, D., Duyn, J. H., Spreng, R. N., Keilholz, S. D., Uddin, L. Q., & Chang, C. (2025). Autonomic physiological coupling of the global fMRI signal. Nature Neuroscience, 28(6), 1327–1335. https://doi.org/10.1038/s41593-025-01945-y

      Chandler, D. J., Gao, W.-J., & Waterhouse, B. D. (2014). Heterogeneous organization of the locus coeruleus projections to prefrontal and motor cortices. Proceedings of the National Academy of Sciences, 111(18), 6816–6821. https://doi.org/10.1073/pnas.1320827111

      Chang, C., Leopold, D. A., Schölvinck, M. L., Mandelkow, H., Picchioni, D., Liu, X., Ye, F. Q., Turchi, J. N., & Duyn, J. H. (2016). Tracking brain arousal fluctuations with fMRI. Proceedings of the National Academy of Sciences, 113(16), 4518–4523. https://doi.org/10/f8ktgg

      Gonzalez-Castillo, J., Fernandez, I. S., Handwerker, D. A., & Bandettini, P. A. (2022). Ultra-slow fMRI fluctuations in the fourth ventricle as a marker of drowsiness. NeuroImage, 259, 119424. https://doi.org/10.1016/j.neuroimage.2022.119424

      Hwang, K., Bertolero, M. A., Liu, W. B., & D’Esposito, M. (2017). The Human Thalamus Is an Integrative Hub for Functional Brain Networks. The Journal of Neuroscience, 37(23), 5594–5607. https://doi.org/10.1523/JNEUROSCI.0067-17.2017

      Jordan, R. (2024). The locus coeruleus as a global model failure system. Trends in Neurosciences, 47(2), 92–105. https://doi.org/10.1016/j.tins.2023.11.006

      Lewis, L. D., Voigts, J., Flores, F. J., Schmitt, L. I., Wilson, M. A., Halassa, M. M., & Brown, E. N. (2015). Thalamic reticular nucleus induces fast and local modulation of arousal state. eLife, 4, e08760. https://doi.org/10.7554/eLife.08760

      Liu, X., De Zwart, J. A., Schölvinck, M. L., Chang, C., Ye, F. Q., Leopold, D. A., & Duyn, J. H. (2018). Subcortical evidence for a contribution of arousal to fMRI studies of brain activity. Nature Communications, 9(1), 395. https://doi.org/10.1038/s41467-017-02815-3

      Lloyd, B., De Voogd, L. D., Mäki-Marttunen, V., & Nieuwenhuis, S. (2023). Pupil size reflects activation of subcortical ascending arousal system nuclei during rest. eLife, 12, e84822. https://doi.org/10.7554/eLife.84822

      Magnin, M., Rey, M., Bastuji, H., Guillemant, P., Mauguière, F., & Garcia-Larrea, L. (2010). Thalamic deactivation at sleep onset precedes that of the cerebral cortex in humans. Proceedings of the National Academy of Sciences, 107(8), 3829–3833. https://doi.org/10.1073/pnas.0909710107

      Meissner, S. N., Bächinger, M., Kikkert, S., Imhof, J., Missura, S., Carro Dominguez, M., & Wenderoth, N. (2023). Self-regulating arousal via pupil-based biofeedback. Nature Human Behaviour, 8(1), 43–62. https://doi.org/10.1038/s41562-023-01729-z

      Müller, E. J., Munn, B., Hearne, L. J., Smith, J. B., Fulcher, B., Arnatkevičiūtė, A., Lurie, D. J., Cocchi, L., & Shine, J. M. (2020). Core and matrix thalamic sub-populations relate to spatio-temporal cortical connectivity gradients. NeuroImage, 222, 117224. https://doi.org/10.1016/j.neuroimage.2020.117224

      Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE International Conference on Tools with Artificial Intelligence, 576–584. https://doi.org/10.1109/ICTAI.2004.50

      Schwarz, L. A., & Luo, L. (2015). Organization of the Locus Coeruleus-Norepinephrine System. Current Biology, 25(21), R1051–R1056. https://doi.org/10.1016/j.cub.2015.09.039

      Shine, J. M. (2019). Neuromodulatory Influences on Integration and Segregation in the Brain. Trends in Cognitive Sciences, 23(7), 572–583. https://doi.org/10.1016/j.tics.2019.04.002

      Shine, J. M., Lewis, L. D., Garrett, D. D., & Hwang, K. (2023). The impact of the human thalamus on brain-wide information processing. Nature Reviews Neuroscience, 24(7), 416–430. https://doi.org/10.1038/s41583-023-00701-0

      Sommer, D., & Golz, M. (2010). Evaluation of PERCLOS based current fatigue monitoring technologies. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 4456–4459. https://doi.org/10.1109/IEMBS.2010.5625960

      Weijs, M. L., Missura, S., Potok-Szybińska, W., Bächinger, M., Badii, B., Carro-Domínguez, M., Wenderoth, N., & Meissner, S. N. (2025). Modulating cortical excitability and cortical arousal by pupil self-regulation. Nature Communications, 16(1), 4552. https://doi.org/10.1038/s41467-025-59837-5

      Yellin, D., Berkovich-Ohana, A., & Malach, R. (2015). Coupling between pupil fluctuations and resting-state fMRI uncovers a slow build-up of antagonistic responses in the human cortex. NeuroImage, 106, 414–427. https://doi.org/10.1016/j.neuroimage.2014.11.034

    1. eLife Assessment

      The importance of uterine natural killer (NK) cells in reproductive success has been demonstrated in mice and humans; however, it is still unclear how uterine NK cells are developed. In this important manuscript, the authors provide convincing evidence that TGF-b signaling in NK cells supports normal pregnancy in mice by the conversion of conventional NK cells into uterine tissue-resident NK cells. Previous concerns have been addressed in this revised version.

    2. Reviewer #1 (Public review):

      This is an excellent paper from Dr. Yokoyama and colleagues. The experiments are technically demanding, given the very low cell numbers and the challenges of working with implantation sites at gestational days 6.5, 10.5, and 14.5. Overall, the impact of TGF-β receptor II deficiency in the NK lineage on uterine trNK cell numbers and litter size is convincing, and the authors' conclusions are well supported by the data. Less convincing, however, is the claim that the decrease in trNK cells is compensated by an increase in cNK cells; rather, the absence of TGF-β receptor II appears to result in an overall reduction of NK/ILC1 cells.

      Comments on revised version:

      I thank the authors for addressing all my comments from my initial review.

    3. Reviewer #2 (Public review):

      In their manuscript "TGF-β drives the conversion of conventional NK cells into uterine tissue-resident NK cells to support murine pregnancy", Yokoyama and colleagues investigate the role of Tgfbr2 expression by NK cells in the formation of tissue-resident uterine NK cells and subsequent importance in murine pregnancy. By transferring congenic splenic conventional NK cells into pregnant mice, they show conversion of circulating NK cells into uterine ivCD45 negative tissue-resident NK cells. When interfering with the formation of uterine trNK cells, spiral artery remodelling was impaired, fetal resorption rates were increased, and litter sizes were reduced.

      Generally, this is a research topic of high interest, yet the manuscript is lacking detailed mechanistical insights and some questions remain open. At the current state, the data represent an interesting characterisation of the Tgfbr2-fl/fl Ncr1-Cre mice in pregnancy, but considering 1) the recent publication by the group (Ref#17) on the role of Eomes+ cNK cells during pregnancy, 2) the previously described role of Tgfbr2 and autocrine TGFb expression for uterine NK cell differentiation in virgin mice (also cited by the authors), and 3) the well-known relevance of uterine NK cells during pregnancy, additional experiments addressing the specific role of Tgfb during pregnancy would help to improve novelty and significance of the manuscript.

      Comments on revised version:

      In their revised version of the manuscript and their point-by-point response, the authors have very carefully addressed and discussed all of our concerns and suggestions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) Figure 1A and B: Although a trend is evident, it does not appear that the absolute number of cNK cells at day 14 is significantly changed from day 6.5?

      We thank the reviewer for this careful observation. We had not originally performed a statistical comparison between the number of cNK cells present at gds 6.5 and 14.5. We have now conducted the appropriate statistical analysis for this dataset and found that the absolute number of cNK cells at day 14.5 is in fact significantly different from day 6.5 (p = 0.0005; unpaired t test, Mann-Whitney correction). The figure and corresponding legend have been updated to reflect this analysis. Please see Figure 1B:

      “Statistics were calculated using unpaired t tests with the Mann-Whitney correction. Error bars indicate SEM; *** p < 0.001.”

      (2) Figure 2E: The authors state, "This reduction of uterine trNK cells was accompanied by a concomitant increase in the absolute number and frequency of CD49b+Eomes+ cNK cells within the pregnant uterus of TGF-βRIINcr1Δ dams (Figure 2 D, E). The number of cNK cells appears relatively low (visually ~1,000-1,300), and although the difference is statistically significant, its physiological relevance is unclear. More importantly, this modest increase does not correlate with the marked decrease in trNK and ILC1 populations, as cNK cells do not appear to accumulate. In my opinion, the conclusion "Collectively, these findings indicate that a TGF-β-driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy" should be slightly toned down.

      We thank both reviewers for this suggestion. Regarding the absence of cNK cell accumulation in the absence of TGF-β signaling, we suggest that this may be related to the normal passage of cNK cells circulating in the placenta, i.e., these cells may not have acquired signals to remain in the uterus and are simply continuing to pass through and not accumulating. Nonetheless, we have rephrased our wording in to address this concern as follows:

      “This reduction of uterine trNK cells was accompanied by a small increase in the absolute number and frequency of CD49b<sup>+</sup> Eomes<sup>+</sup> cNK cells within the pregnant uterus of TGF-βRII<sup>Ncr1∆</sup> dams (Figure 2 D, E). Collectively, these findings suggest that a TGF-β–driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy.”

      “The absence of cNK cell accumulation in the gravid uterus in the setting of impaired TGF-β signaling suggests a defect in tissue retention rather than recruitment. In the absence of TGF-β–mediated cues, circulating cNK cells that enter the uterine vasculature may fail to acquire the molecular programs required for residency and instead continue to transit through the tissue. This is consistent with a model in which TGF-β signaling promotes not only phenotypic conversion but also the acquisition of retention signals necessary for persistence within the uterine microenvironment, reinforcing that acquisition of tissue-residency in the gravid uterus is an actively instructed process [29,32].”

      (3) Figures 2-4: It is unclear whether the littermate controls are floxed mice or floxhet-Ncr1iCre mice? This distinction is important, as Ncr1iCre expression itself could potentially lead to a phenotype.

      To address these concerns, we characterized the uterine innate lymphoid cell compartment in the pregnant uterus of Ncr1<sup>icre</sup> dams at gestational day 6.5. We did not observe a difference in the absolute number and frequency of trNK cells, cNK cells, and ILC1s in the gravid uterus of Ncr1<sup>icre</sup> dams compared to wildtype CD45.1 C57BL/6 mice. Additionally, the number of implantation sites and resorption rates in Ncr1<sup>icre</sup> dams was comparable to wildtype CD45.1 C57BL/6 mice. Together these data indicate that Ncr1<sup>icre</sup> expression itself does not influence the phenotype we report in TGF-βRII<sup>Ncr1∆</sup> dams. These additional findings have been included in Supplementary Figure 1 and in the text as follows:

      “To ensure we exclude a confounding effect of Ncr1<sup>iCre</sup> expression, we profiled the uterine innate lymphoid compartment in pregnant Ncr1<sup>iCre</sup> dams at gestational day 6.5. No differences were observed in the absolute number of trNK cells, cNK cells, or ILC1s relative to wildtype controls (Figure S1 A-D), and implantation site number and resorption rates were likewise unchanged (Figure S1 E-F). These data indicate that Ncr1<sup>iCre</sup> expression alone does not perturb uterine ILC composition or early pregnancy outcomes.”

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1C &D: The adoptive transfer experiment is convincing. As a minor point, why is the gate setting for Eomes different between panels 1C and 1D?

      To clarify the phenotype of the adoptively transferred cNK cells, we included two additional gates depicting the expression of CD49a and CD49b in unlabeled (non-vascular) trNK cells and cNK cells in the pregnant uterus Please see the revised Figure 1C and revised figure legend:

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

      (2) Figure 3: Has the pup ratio male/female changed?

      We did not observe a statistically significant difference in the female-to-male pup ratio between groups.

      Reviewer #2 (Public review):

      (1) The authors suggest cNK extravasation and local differentiation into iv- trNK. Can it be estimated how much this process contributes to the trNK pool vs. a potential local proliferation of already existing trNK? How do absolute numbers of CD49a+ Eomes+ trNK change during pregnancies? (In Figure 1A, the cell numbers of CD49a+ Eomes+ trNK seem to go down dramatically between gd 6.5 and 14.5). The plot in 1B could also include absolute numbers of ILC1s and trNKs. Would recruited cNK cells compensate for a potential loss of CD49a+ Eomes+ trNK?

      Our prior work as well as others have tracked the changes in uterine trNK cells, cNK cells, and ILC1s over the course of murine pregnancy. Consistent with these studies, the absolute number of uterine CD49a<sup>+</sup> Eomes<sup>+</sup> trNK cells peaks during early pregnancy (roughly between gds 5.5 7.5) and subsequently declines until term. The decrease in uterine trNK cells between gd 6.5 and gd 14.5 observed in Figure 1A is therefore consistent with the known physiological contraction of the decidual NK compartment as pregnancy progresses. Thus, it is unlikely that cNK cells recruited within the uterine tissue compensate for the loss of CD49a<sup>+</sup> Eomes<sup>+</sup> trNK cells observed. To address the reviewer’s request, we have now included the absolute number of uterine trNK cells and ILC1s in Figure 1–please see updated Figure 1C and D and corresponding figure legend (provided below). With respect to the relative contribution of cNK cells extravasation vs local proliferation of trNK cells, our data do not allow us to quantitatively distinguish between these mechanisms. Moreover, previous studies have demonstrated that uterine trNK cells express Ki67, suggesting that they exhibit proliferative activity during this period. Thus, we hypothesize that both local proliferation of existing trNK cells and recruitment of circulating cNK cells contribute to the population of uterine trNK cells during early pregnancy.

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells. (D) Proportion of uterine ILC subsets derived from adoptively transferred splenic cNK cells in the pregnant uterus of wildtype dams. Statistics were calculated using unpaired t tests with the Mann-Whitney correction. Error bars indicate SEM; ***p < 0.001.”

      Barahona, J.D., Yang, L. and Yokoyama, W.M., 2025. Eomesodermin defines uterine NK cells crucial for pregnancy success in mice. The Journal of Immunology, 214(10), pp.2549-2556.

      Filipovic, I., Chiossone, L., Vacca, P., Hamilton, R.S., Ingegnere, T., Doisne, J.M., Hawkes, D.A., Mingari, M.C., Sharkey, A.M., Moretta, L. and Colucci, F., 2018. Molecular definition of group 1 innate lymphoid cells in the mouse uterus. Nature Communications, 9(1), p.4492.

      (2) Figure 1C: 2.5 Mio cNK cells have been transferred, but only very few cells can be detected within the uterus (concatenated FACS plot shown). What may represent the limit to generate uterine trNK out of cNK? Is the niche supporting cNK-trNK differentiation limited? Is it only a specific subset of (splenic) cNK capable of differentiating into trNK? Is gd 0.5 the optimal timepoint for the transfer? Is there continuous recruitment of cNK into the uterus and differentiation into trNK, or is it enhanced at specific timepoints of pregnancy? Could there be local proliferation of cNK-derived trNK? This could be studied by proliferation dye dilution of WT cNK cells in this transfer-setup.

      We recognize that transferring cNK cells at gestational day 0.5–prior to placental formation–may partially account for the low uterine reconstitution observed. At this time point, the local signals necessary for efficient recruitment and retention of cNK cells in the uterus may not yet be fully established, potentially resulting in preferential homing to peripheral tissues such as the spleen and liver. Consistent with this possibility, we do observe a robust population of adoptively transferred cNK cells in the spleen and liver of our pregnant dams. We decided to transfer cNK cells at gestational day 0.5 to ensure that the cells were present at throughout most of early pregnancy, particularly during implantation and the initial stages of decidualization. We also did not transfer cells before mating to minimize the number of mice that did not get pregnant. Additionally, performing the transfer at this early time point minimized repeated manipulation of pregnant dams, as procedural stress itself has been shown to affect physiological processes of gestation and could thereby confound the pregnancy outcomes we were assessing. Furthermore, Filipovic et al. 2018 previously showed that both trNK cells and cNK cells in the pregnant uterus expressed Ki67 at gestational 9.5, suggesting that there could be local proliferation of cNK-derived trNK cells in the gravid uterus that could limit the migration of circulating cNK cells into this microenvironment. We have discussed in more depth in our discussion section as follows:

      “Interestingly, the inability to fully reconstitute the uterine trNK cell compartment following adoptive transfer suggests that only a subset of circulating cNK cells may be capable of differentiating into trNK cells during pregnancy, or alternatively that trNK cells already present in the virgin uterus may undergo in situ proliferation in the gravid uterus. Previous studies from our lab as well as others show that trNK cells within the pregnant murine uterus express marked levels of Ki67, supporting a model in which local proliferation of uterine trNK cells is a major contributor to the uterine trNK cell pool during pregnancy [7,32]. Prior studies have also described hematopoietic precursors within endometrial and decidual tissues that generate uterine trNK cells, suggesting that the compartment may be also sustained by local precursor differentiation [33-35]. Together, these findings suggest that uterine trNK cell ontogeny may be more complex than a single-source model and raise the possibility that distinct developmental pathways may operate at different stages of reproductive life. Therefore, defining the relative contribution and developmental timing of hematogenous versus locally maintained sources in vivo could provide relevant insights into the developmental trajectories and transcriptional programs that underlie decidual NK cell heterogeneity.”

      Zhai, Q.Y., Wang, J.J., Tian, Y., Liu, X. and Song, Z., 2020. Review of psychological stress on oocyte and early embryonic development in female mice. Reproductive Biology and Endocrinology, 18(1), p.101.

      Wiebold, J.L., Stanfield, P.H., Becker, W.C. and Hillers, J.K., 1986. The effect of restraint stress in early pregnancy in mice. Reproduction, 78(1), pp.185-192.

      Sánchez-Rubio, M., Abarzúa-Catalán, L., Del Valle, A., Méndez-Ruette, M., Salazar, N., Sigala, J., Sandoval, S., Godoy, M.I., Luarte, A., Monteiro, L.J. and Romero, R., 2024. Maternal stress during pregnancy alters circulating small extracellular vesicles and enhances their targeting to the placenta and fetus. Biological Research, 57(1), p.70.

      Filipovic, I., Chiossone, L., Vacca, P., Hamilton, R.S., Ingegnere, T., Doisne, J.M., Hawkes, D.A., Mingari, M.C., Sharkey, A.M., Moretta, L. and Colucci, F., 2018. Molecular definition of group 1 innate lymphoid cells in the mouse uterus. Nature Communications, 9(1), p.4492.

      (3) The authors should consider inducible Tgfbr2 deletion (e.g. with Tamoxifen-inducible Cre) to enable development of the uterine NK compartment in virgin mice and only ablate trNK differentiation during pregnancy. This could help to estimate the turnover of cNK into trNK, or to understand if constant cNK recruitment is required to form the uterine trNK compartment during pregnancy.

      Thank you for this suggestion. We did initially consider incorporating a mouse model with a tamoxifen-inducible deletion of the TGF-βRII to examine the differentiation of peripheral cNK cells into uterine trNK cells more precisely. However, the administration of tamoxifen during murine pregnancy has well-established deleterious effects on implantation, fetal viability, and placentation, which would confound our interpretations of any adverse pregnancy outcome observed in our studies. Because our goal was to assess NK cell-specific contributions to murine gestation without introducing additional pregnancy-related perturbations, we elected to use an Ncr1<sup>iCre</sup> – based mouse model in our studies.

      Ved, N., Curran, A., Ashcroft, F.M. and Sparrow, D.B., 2019. Tamoxifen administration in pregnant mice can be deleterious to both mother and embryo. Laboratory animals, 53(6), pp.630-633.

      Sun, M.R., Steward, A.C., Sweet, E.A., Martin, A.A. and Lipinski, R.J., 2021. Developmental malformations resulting from high-dose maternal tamoxifen exposure in the mouse. PLoS One, 16(8), p.e0256299.

      Ilchuk, L.A., Stavskaya, N.I., Varlamova, E.A., Khamidullina, A.I., Tatarskiy, V.V., Mogila, V.A., Kolbutova, K.B., Bogdan, S.A., Sheremetov, A.M., Baulin, A.N. and Filatova, I.A., 2022. Limitations of tamoxifen application for in vivo genome editing using Cre/ERT2 system. International Journal of Molecular Sciences, 23(22), p.14077.

      (4) Did the authors consider transfer of Tgfbr2-floxed Ncr1-Cre cNK in the same setup as in Fig. 1C? This experiment could confirm the requirement of Tgfbr-dependent signaling for cNK to trNK conversion during pregnancy versus effects of Tgfb signals on trNK numbers in the uterus at steady state (before pregnancy).

      We thank the reviewer for this mechanistically insightful suggestion. We did consider performing reciprocal transfer experiments using TGF-βRII<sup>fl/fl</sup> Ncr1<sup>icre</sup> cNK cells in the same adoptive transfer system as in Figure 1C. Our current adoptive transfer experiments already directly address this question. Transfer of congenically labeled wild-type splenic cNK cells into TGF-βRII<sup>Ncr1Δ</sup> dams at gestational day 0.5 resulted in partial reconstitution of the uterine trNK compartment and, importantly, this was sufficient to rescue the adverse pregnancy outcomes observed at midgestation. These findings indicate that TGF-β–competent cNK cells can differentiate and function appropriately within the pregnant uterine environment, supporting a requirement for TGF-β–dependent signaling in cNK-to-trNK conversion during pregnancy. Because restoration of TGF-β–sufficient cNK cells rescues these pregnancy outcomes, we believe this experiment functionally demonstrates the importance of TGF-β signaling in this process and therefore did not pursue reciprocal transfer of TGF-βRII–deficient cNK cells.

      “Partial reconstitution of uterine trNK cells restores midgestational pregnancy outcomes in TGF-βRII<sup>Ncr1∆</sup> dams

      To determine whether restoring uterine trNK cells could rescue the midgestational pregnancy defects observed in TGF-βRII<sup>Ncr1∆</sup> dams, we adoptively transferred wildtype, congenically labeled splenic cNK cells into pregnant TGF-βRII<sup>Ncr1∆</sup> dams at gd 0.5. By gd 10.5, donor cNK cells were detected in the pregnant uterus, where a subset upregulated CD49a and downregulated CD49b, consistent with acquisition of a uterine trNK cell phenotype (Figure 5 A). However, adoptively transferred splenic cNK cells only partially reconstituted the uterine trNK cell population in the gravid uterus of TGF-βRII<sup>Ncr1∆</sup> dams, as evidenced by reduced absolute number and frequency of donor-derived trNK cells in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams (Figure 5 A-C). Notably, this partial reconstitution was sufficient to rescue the gestational defects caused by impaired TGF-β–mediated uterine trNK cell differentiation. Reconstituted TGF- βRII<sup>Ncr1∆</sup> dams exhibited implantation site numbers and fetal resorption rates at gd 10.5 comparable to those observed in littermate controls (Figure 5 D, E). Together, these findings suggest that even partial restoration of the uterine trNK cell in pregnant TGF-βRII<sup>Ncr1∆</sup> dams is sufficient to restore pregnancy outcomes at midgestation, supporting a central role for uterine trNK cells as the principal NK cell subset required for successful murine pregnancy.”

      (5) Figures 2D/E: The authors should state that ILC1s are reduced in the virgin uterus of female Tgfbr2-floxed or Tgfb1-floxed Ncr1-Cre mice and cite the relevant work (the Ref #29 discussed in this context did not show that?). It would be helpful to include an analysis of all three uterine ILC subsets in steady state. This could help to answer the question if the cNK cell changes are pregnancy-specific or a general phenomenon in Tgfbr2-floxed Ncr1-Cre mice.

      We thank the reviewer for this important comment and for noting the miscitation. We regret the error and have corrected the reference in the revised manuscript to cite the appropriate study demonstrating reduced ILC1s in the virgin uterus of Tgfb1<sup>fl/fl</sup> Ncr1<sup>iCre</sup> mice {Sparano, C. et al. 2024. Autocrine TGF-β1 drives tissue-specific differentiation and function of resident NK cells. Journal of Experimental Medicine, 222(3), p.e20240930}. Please see Line 148. Importantly, the steady-state ILC compartment in virgin Tgfb1<sup>fl/fl</sup> Ncr1<sup>iCre</sup> mice has already been carefully characterized in the previously published work, including analysis of all three uterine ILC subsets. Because the steady-state uterine ILC landscape in this mouse model has already been established by Sparano, C. et al. 2024, our study focuses specifically on the pregnancy-associated changes in the uterine ILC landscape occurring in the absence of TGF-β signaling in Ncr1-expressing cells and their subsequent effects on gestational outcomes. In the absence of TGF-β signaling there appears to be a higher frequency of cNK cells in both the virgin uterus and pregnant uterus, suggesting that this is more of a general phenomenon.

      “However, in the pregnant uterus, CD49a<sup>+</sup> Eomes<sup>-</sup> ILC1s were markedly reduced in implantation sites of TGF-βRII<sup>Ncr1∆</sup> dams, paralleling the reduction of ILC1s previously reported in the virgin uterus of TGF-βRII<sup>Ncr1∆</sup> female mice [26].”

      (6) Figure 2E: Please phrase more carefully about the "concomitant increase" of cNKs, since this increase is much less pronounced compared to the very strong reduction (absence) of trNKs in Tgfbr2-floxed Ncr1-Cre mice. Do the authors suggest that cNKs are halted at this stage and cannot differentiate into trNK, based on these data?

      We thank both reviewers for this suggestion, and we have rephrased our wording to address this concern as follows:

      “This reduction of uterine trNK cells was accompanied by a small increase in the absolute number and frequency of CD49b<sup>+</sup> Eomes<sup>+</sup> cNK cells within the pregnant uterus of TGF-βRII<sup>Ncr1∆</sup> dams (Figure 2 D, E). Collectively, these findings suggest that a TGF-β–driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy.”

      Please also see our response to Reviewer #1, Comment #2.

      (7) Can the reduced litter size and the abnormal spiral artery formation be rescued by transfer of WT cNK into Tgfbr2-floxed Ncr1-Cre mice?

      We thank the reviewers for this interesting question. In subsequent experiments, we transferred congenically labeled, splenic cNK cells from wildtype female mice into TGF-βRII<sup>Ncr1∆</sup> dams at gestational day 0.5. We only observed partial reconstitution of uterine trNK cell population; however, the number of viable implantation sites and resorption rates in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams were comparable to the number of viable implantation sites and resorption rates in HBSS-treated littermate controls at gestational day 10.5. Given that partial reconstitution of the uterine trNK cell compartment in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams was sufficient to rescue the defects in implantation site number and fetal resorption rates observed at midgestation, we hypothesize that this level of restoration may permit patrial but functionally sufficient spiral artery remodeling to reestablish maternal-fetal blood flow adequate to support fetal viability, although spiral artery remodeling was not directly assessed in this transfer study.

      “Partial reconstitution of uterine trNK cells restores midgestational pregnancy outcomes in TGF-βRII<sup>Ncr1∆</sup> dams

      To determine whether restoring uterine trNK cells could rescue the midgestational pregnancy defects observed in TGF-βRII<sup>cr1∆</sup> dams, we adoptively transferred wildtype, congenically labeled splenic cNK cells into pregnant TGF-βRII<sup>Ncr1∆</sup> dams at gd 0.5. By gd 10.5, donor cNK cells were detected in the pregnant uterus, where a subset upregulated CD49a and downregulated CD49b, consistent with acquisition of a uterine trNK cell phenotype (Figure 5 A). However, adoptively transferred splenic cNK cells only partially reconstituted the uterine trNK cell population in the gravid uterus of TGF-βRII<sup>Ncr1∆</sup> dams, as evidenced by reduced absolute number and frequency of donor-derived trNK cells in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams (Figure 5 A-C). Notably, this partial reconstitution was sufficient to rescue the gestational defects caused by impaired TGF-β–mediated uterine trNK cell differentiation. Reconstituted TGF-βRII<sup>Ncr1∆</sup> dams exhibited implantation site numbers and fetal resorption rates at gd 10.5 comparable to those observed in littermate controls (Figure 5 D, E). Together, these findings suggest that even partial restoration of the uterine trNK cell in pregnant TGF-βRII<sup>Ncr1∆</sup> dams is sufficient to restore pregnancy outcomes at midgestation, supporting a central role for uterine trNK cells as the principal NK cell subset required for successful murine pregnancy.”

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1C: The shown gate seems to "cut" into the CD49b staining; staining for all transferred cells should be shown; have cNK cells been stained in parallel with the same panel to provide a positive and compensation control?

      To clarify the phenotype of the adoptively transferred cNK cells, we included two additional gates depicting the expression of CD49a and CD49b in unlabeled (non-vascular) trNK cells and cNK cells in the pregnant uterus Please see the revised Figure 1C.

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

      (2) Figure 2A: The authors could include an isotype control or a staining in a genetic knockout as a control staining.

      Thank you for this suggestion. As suggested, we included staining in a genetic TGF-βRII<sup>Ncr1∆</sup> knockout as additional control staining. Please see the revised Figure 2A.

      “Representative histograms depicting TGF-β Receptor II expression on splenic NK cells from virgin TGF-βRII<sup>Ncr1∆</sup> and wildtype mice as well as splenic and uterine NK cell subsets from pregnant wildtype mice at gd 10.5 (virgin TGF-βRII<sup>Ncr1∆</sup> mice, n=2; virgin mice: C57BL/6, n=5; gd 10.5: C57BL/6 dams, n=8, implantation sites n=8). MFI, median fluorescent intensity. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

    1. eLife Assessment

      This manuscript provides important insights into how U2AF2-dependent intron retention regulates the localization and function of long noncoding RNAs, with evidence supported by multiple complementary approaches. The work is notable for linking intron retention to nuclear speckle localization and cellular phenotypes, including proliferation and migration, although the mechanistic basis remains incompletely resolved. Overall, the study presents a compelling dataset with clear biological implications but would benefit from additional analyses to strengthen mechanistic interpretation and generality.

    2. Reviewer #1 (Public review):

      Intron retention is observed in many long noncoding RNAs. The authors here used a powerful genome-wide screening strategy to identify proteins controlling intron retention in the long noncoding RNA PURPL. One of the top hits across multiple cell lines surprisingly, was U2AF2, which is well known to bind the polypyrimidine tract close to the 3' splice site to promote splicing. Nonetheless, U2AF2 is working in the opposite direction here. Convincing follow-up RT-PCR experiments confirmed that knocking down U2AF2 does indeed lead to reduced intron retention of PURPL. The authors then show that this intron retention event is functionally important for both the nuclear retention of PURPL as well as its ability to enhance cell proliferation.

      The authors then used transcriptome-wide analyses to look for additional intron retention events affected by U2AF2. Among the ~250 genes with decreased intron retention (more splicing) upon U2AF2 knockdown was MALAT1, a well-established long noncoding RNA that normally localizes to nuclear speckles. Depletion of U2AF2 or removal of the MALAT1 2nd intron resulted in reduced speckle localization and cell migration, revealing a critical and fascinating role for this intron retention event. Overall, the authors have used a set of complementary approaches to clearly demonstrate a very intriguing role for U2AF2 in controlling intron retention and functionality of a set of long noncoding RNAs.

      I feel the current work has revealed an important role of intron retention in controlling the localization and functionality of long noncoding RNAs, which is likely broad in scope and is likely regulated by cell state.

      One experimental suggestion: The authors show that expressing intron-2 containing PURPL in PURPL-depleted cells is sufficient to induce faster proliferation, but a valuable comparison would be identifying the phenotype expressing spliced PURPL transcript.

    3. Reviewer #2 (Public review):

      Summary:

      This study identified U2AF1/2 as a regulator of pre-mRNA splicing that either promotes or supresses the splicing of introns on different genes. The authors then focused on two genes PURPL and MALAT1 that U2AF1/2 can promote intron retention of specific introns, and characterized the biological implications of these introns regulated by U2AF1/2.

      Strengths:

      (1) The experiments in this manuscript are relatively rigorously designed and performed, often with validation checks such as verifying the knockout, verifying the treatment itself doesn't have an effect, etc.

      (2) The experiments provided comprehensive support for the claims that these specific introns are important for the stability or nuclear localization of the RNA, as well as that U2AF1/2 suppresses the splicing of these introns.

      (3) The writing of the manuscript is very clear and doesn't overstate the conclusions that can be drawn from the experiments.

      Weaknesses:

      I think one main weakness of this study is the lack of a deeper analysis of the mechanisms. Whether studying the mechanism is within the scope of this paper is probably debatable, but with the current experiment setup and data, I believe there are some analyses that can be relatively easily done to enhance the value or significance of this study. My detailed questions and suggestions are listed below:

      (1) Line 194-195 and Figure 2A: How many RBPs are included in "other RBPs" in line 194? Does "other RBPs" only include PTBP1, PRPF8 and SRSF1 in Figure 2A, or do they include all the ~100 RBPs with HepG2 eCLIP data available on ENCODE? If U2AF1/2 have the highest occupancy around the intron 2 region among the ~100 RBPs, it would be nice to visualize it.

      (2) Figure 2A and 2B: Why didn't U2AF2 show interaction with exon 2 and 3 in RNA-IP but showed enrichment over exon 2 and exon 3 regions in the eCLIP data?

      (3) Figure 3C - 3F: Maybe I misinterpreted the experiments, but to my understanding, these experiments showed that the exogenous PURPL with intron 2 promoted cell proliferation compared to when the exogenous PURPL wasn't induced, but didn't compare to the effect of the same amount of PURPL with intron 2 removed. Wouldn't it be clearer to compare the effects of exogenous PURPL with intron 2 and exogenous PURPL without intron 2 to pinpoint whether the effect is related to intron 2? Without an intron 2 specific experiment, these current experiments don't seem to provide much added value than "PURPL promotes cell proliferation".

      (4) It's not very clear what proportion of these introns are retained in the endogenous PURPL and MALAT1 in various tissues, cell types and conditions. I think it will be valuable to provide this background (either from previous research, public database or data from this study).

      (5) Since U2AF1/2 have a wide range of targets as demonstrated by Figure 4A, I think it would be valuable to have some experiments that directly disrupt the interaction between U2AF1/2 and PURPL and MALAT1 and test the effect on splicing outcomes, such as by mutating the sequence that U2AF1/2 bind to. The section on the weak py-tract of PURPL touched upon this topic but focused more on how the weak py-tract causes the intron 2 retention in the background rather than how U2AF1/2 binding and action were affected by sequence mutations. I think experiments on disrupting the direct binding between U2AF1/2 on targets can provide valuable mechanistic insights.

      (6) Across all the target genes of U2AF1/2, it might be feasible to do some systematic analysis to find what correlates with whether U2AF1/2 have a promoting or suppressing effect on intron splicing. For example, do genes with decreased IR after U2AF2 depletion systematically have a weak py-tract compared to genes with increased IR? This dataset can potentially provide many hypotheses for understanding the dual role of U2AF1/2.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript characterized the splicing regulation of two long non-coding RNAs relevant to cancer, starting with a focus on PURPL and ending with insights into MALAT1. A CRISPR screen for the regulators of PURPL intron retention revealed a role for the U2AF heterodimer in inducing this retention, with U2AF2 as the actual hit. This is surprising, because the canonical function of U2AF is to recognize the polypyrimidine tract (PPT) and 3' splice site junction to induce splicing at the site. The brief mechanistic characterization of this phenomenon showed that this intron retention accounts for the nuclear localization and instability of the PURPL transcript, and seems to confer the enhanced cell proliferation feature. U2AF2 also induces retention of two introns in MALAT1, and one of them is essential for its nuclear speckle localization and enhanced cell migration.

      Strengths:

      These findings about PURPL and MALAT1 are clear and interesting.

      Weaknesses:

      The results are not sufficiently connected to each other, because one regulation is nuclear-speckle dependent but not the other.

      Here are my specific comments:

      Major comments:

      The main issue is the lack of focus because of the distinct and incomplete analysis pertaining to the two long noncoding RNAs, PURPL and MALAT1. The paper starts with a very good genetic screen on the former, and immunofluorescence and functional analysis on the latter, with U2AF2 as the main link to induce intron retention. The first one does not show clear localization while the second docks to nuclear speckles, apparently because of the retained intron. Hence the two mechanisms are related yet distinct. Here are some suggestions to enhance the characterization and connection between the two cases:

      (1) As the MALAT1 intron 2 retention contributes to its speckle localization but not the retained PURPL intron, the retained introns or their 3' splice site sequences should be swapped to see if they determine the localization.

      (2) Figure 3, the rescue of the PURPL knockout by the intron-retained RNA to induce proliferation is a powerful experiment, that is lacking the rescue with the RNA without the intron as a control. This must be done and shown.

      (3) The weakness of the PPT of PURPL intron 2 appears as a clear feature of its retention dependent on U2AF2, which appears direct, as backed by CLIP data. It would be good to show direct binding by EMSA or equivalent techniques. Furthermore, the data is also consistent with other determinants. The exon and upstream intronic sequences, including the branch point, could also be involved, so mutations in these are also required.

      (4) In brief, what are the commonalities and differences between PURPL and MALAT1 with regard to their U2AF2-dependent intron retention?

    1. eLife Assessment

      This important study establishes the first vertebrate models of DeSanto-Shinawi Syndrome, revealing conserved craniofacial and social and behavioral phenotypes across mouse and zebrafish that mirror key clinical features. The convincing evidence is supported by behavioral, anatomical, and molecular analyses of Wac animal mutants. This study sets a baseline for future mechanistic studies and reports a platform to test approaches to reverse phenotypes.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      The authors generated mouse and zebrafish models for DeSanto-Shinawi Syndrome, caused by loss-of-function variants in the WAC gene. Using these vertebrate systems, they demonstrate conserved craniofacial and social-behavioral phenotypes that parallel human clinical features, along with deficits in GABAergic markers. They observe increased seizure susceptibility and male-biased brain volumetric changes in Wac mutant mice. Together, these findings begin to define the biological consequences of Wac haploinsufficiency and provide valuable resources for future mechanistic studies.

      Strengths:

      WAC is a high-confidence neurodevelopmental disorder gene and one of the genes identified by large-scale exome sequencing efforts, including the Satterstrom et al. (2020) autism spectrum disorder cohort. This study establishes the first vertebrate Wac models, addressing a major gap in the understanding of DeSanto-Shinawi Syndrome, and provides a framework for studying other syndromic forms of autism. The models generated will be impactful and useful to the community to study and understand DeSanto-Shinawi Syndrome.

      The cross-species analysis is important and well executed, and reveals both conserved and divergent phenotypes. The behavioral and anatomical assays are rigorously executed and well-controlled, and the inclusion of RNA-sequencing analyses adds valuable insights into the mechanisms underlying brain function in Wac mutants. Notably, the RNA-seq data reveal upregulation of several clustered protocadherins, genes central to neuronal identity and cell-cell interactions, which are known to be regulated by dynamic developmental regulation of chromatin architecture. This observation provides an intriguing hint that could link Wac function to higher-order chromatin organization and neuronal connectivity.

      Weaknesses:

      The evidence is solid, though the study remains incomplete in its mechanistic depth and molecular interpretation. The authors compellingly describe behavioral, anatomical, and transcriptomic phenotypes associated with WAC loss, yet do not explore how WAC mechanistically regulates chromatin or transcription. Given prior evidence that WAC interacts with the RNF20/40 ubiquitin ligase complex and promotes histone H2B ubiquitination and transcriptional elongation, the paper would benefit from a discussion of these functions as a potential link between Wac haploinsufficiency and the observed changes in neuronal gene expression. Similarly, the authors mention WAC's WW and coiled-coil domains but do not consider how these domains could mediate nuclear interactions or recruitment of transcriptional cofactors that shape gene regulation and chromatin organization in neurons.

      The transcriptomic analysis is rich but largely descriptive. Although the upregulation of clustered protocadherins is particularly intriguing, these findings are not validated or localized to specific neuronal populations. The study would be strengthened by independently validating the most significant RNA-seq changes, such as protocadherin gamma genes, using in situ hybridization methods to confirm the spatial and cellular specificity of expression changes.

    3. Reviewer #2 (Public review):

      The authors describe the first deep neurological characterization of WAC mutation in two vertebrate species (zebrafish and mouse). They examine these at various levels, guided by the work in humans that has associated a heterozygous WAC mutation with DeSantos Shinawi Syndrome (DESSH). Therefore, they investigate the animals for a variety of phenotypes, following a template for what is seen when characterizing a new mouse/fish model of a developmental disability gene. Investigations include analysis of skull and jaw for abnormalities(both species), MRI of brain structure(in mice), electrophysiology(mice), assessment of signaling pathways (by Western blot, in mice), cell counts (both, more in mice), transcriptomics (mice), and behavior (both).

      Generally, this describes an important first characterization of the consequences of the mutation. Most of the studies appear well-conducted and reasonably powered, thus solid or convincing.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors generated mouse and zebrafish models for DeSanto-Shinawi Syndrome, caused by loss-of-function variants in the WAC gene. Using these vertebrate systems, they demonstrate conserved craniofacial and social-behavioral phenotypes that parallel human clinical features, along with deficits in GABAergic markers. They observe increased seizure susceptibility and male-biased brain volumetric changes in Wac mutant mice. Together, these findings begin to define the biological consequences of Wac haploinsufficiency and provide valuable resources for future mechanistic studies.

      Strengths:

      WAC is a high-confidence neurodevelopmental disorder gene and one of the genes identified by large-scale exome sequencing efforts, including the Satterstrom et al. (2020) autism spectrum disorder cohort. This study establishes the first vertebrate Wac models, addressing a major gap in the understanding of DeSanto-Shinawi Syndrome, and provides a framework for studying other syndromic forms of autism. The models generated will be impactful and useful to the community to study and understand DeSanto-Shinawi Syndrome.

      The cross-species analysis is important and well executed, and reveals both conserved and divergent phenotypes. The behavioral and anatomical assays are rigorously executed and well-controlled, and the inclusion of RNA-sequencing analyses adds valuable insights into the mechanisms underlying brain function in Wac mutants. Notably, the RNA-seq data reveal upregulation of several clustered protocadherins, genes central to neuronal identity and cell-cell interactions, which are known to be regulated by dynamic developmental regulation of chromatin architecture. This observation provides an intriguing hint that could link Wac function to higher-order chromatin organization and neuronal connectivity.

      Weaknesses:

      The evidence is solid, but the study remains incomplete in its mechanistic depth and molecular interpretation. The authors compellingly describe behavioral, anatomical, and transcriptomic phenotypes associated with WAC loss, yet do not explore how WAC mechanistically regulates chromatin or transcription. Given prior evidence that WAC interacts with the RNF20/40 ubiquitin ligase complex and promotes histone H2B ubiquitination and transcriptional elongation, the paper would benefit from a discussion of these functions as a potential link between Wac haploinsufficiency and the observed changes in neuronal gene expression. Similarly, the authors mention WAC's WW and coiled-coil domains but do not consider how these domains could mediate nuclear interactions or recruitment of transcriptional cofactors that shape gene regulation and chromatin organization in neurons.

      We agree that many mechanisms underlying how both animal model phenotypes and human symptoms that are caused by the Wac gene still need to be worked out. Due to the need to generate a great deal of data to first describe these models in this manuscript this will be expanded upon later. In lieu of this, we plan to follow up with mechanistic papers later to fully address the gap that remains. We have now added a paragraph in the discussion to bring up these important points regarding the roles of Wac during transcription and how its protein domains might be involved in these processes.

      The transcriptomic analysis is rich but largely descriptive. Although the upregulation of clustered protocadherins is particularly intriguing, these findings are not validated or localized to specific neuronal populations. The study would be strengthened by independently validating the most significant RNA-seq changes, such as protocadherin gamma genes, using in situ hybridization methods to confirm the spatial and cellular specificity of expression changes.

      We have greatly expanded the analyses of the bulk RNA-seq data, including a more rigorous look into the differences in gene expression between sexes, which has additionally revealed males to be more impacted by Wac loss of function. We have also added new western blot data for pan protocadherin alpha, which is now validated to be upregulated in the cortex (new Figure 7I and 7J). We are holding back any additional data from this report as we have single nucleus RNA-seq data that will be reported on in follow-up papers with targeted conditional deletion models.

      Finally, while the behavioral and MRI results add valuable breadth, their interpretation would be improved by clearer reporting of sample sizes, statistical corrections, and effect sizes to support claims of sex-specific and regional brain volume differences.

      Some additional details have been added to the methods section. In addition, we have now provided sample sizes assessed in each figure legend.

      Reviewer #2 (Public review):

      The authors describe the first deep neurological characterization of WAC mutation in two vertebrate species (zebrafish and mouse). They examine these at various levels, guided by the work in humans that has associated a heterozygous WAC mutation with DeSantos Shinawi Syndrome (DESSH). Therefore, they investigate the animals for a variety of phenotypes, following a template for what is seen when characterizing a new mouse/fish model of a developmental disability gene. Investigations include analysis of skull and jaw for abnormalities(both species), MRI of brain structure(in mice), electrophysiology(mice), assessment of signaling pathways (by Western blot, in mice), cell counts (both, more in mice), transcriptomics (mice), and behavior (both).

      Generally, this describes an important first characterization of the consequences of the mutation. Most of the studies appear well-conducted and reasonably powered, thus solid or convincing. However, there are a few places where the data presentation could be improved for clarity, and a few concerns about some choices in analytical approach for a couple of the experiments, where improved statistical approaches could improve their sensitivity and/or better rule out false positives, and thus the support of some of these claims is currently incomplete. There is also some lack of clarity about the rationale for some decisions regarding the fish genetics. Nonetheless, this is an important and useful first characterization of many phenotypes of these lines. Such experiments form a baseline for future mechanistic studies in the same lines and a platform to test approaches to reverse phenotypes.

      Individual claims and their strength & weaknesses:

      (1) The authors developed mouse and zebrafish models of WAC deletion

      They used the existing KOMP floxed WAC line to generate a null allele. For the mouse, there is a Western showing that it is indeed null for the protein. The fish data is less robustly validated - they don't confirm the allele in null at the protein or RNA level, and fish have two paralogs (waca and wacb), and this paper only characterizes one of these. So this evidence is less clear. The evaluated mice are heterozygous (Het), similar to patients, while the fish appear to be evaluated as homozygous mutants.

      We agree with the reviewer’s comments on zebrafish genetics. Since antibodies against zebrafish Wac proteins are not available, we could not examine protein levels in zebrafish. We predicted frameshift mutations due to DNA analyses in waca and wacb KO zebrafish. We made waca KO, wacb KO, and waca/wacb double KO zebrafish. waca/wacb double KO zebrafish showed a lethal phenotype, similar to homozygous mice mutants. Since wacb KO zebrafish did not show any detectable phenotype we do not report those here. However, we now show examples of the wacb and dKO zebrafish in Figure S1. Since waca KO zebrafish showed craniofacial and behavioral phenotypes that are comparable to mice Het and human patients, they are focused on in this report.

      (2) The authors show that both species show altered craniofacial features

      These data appear well powered, and the findings are robust.

      We appreciate this confirmation.

      (3) Each model altered GABAergic neurons

      In mice, the authors stained with PV antibodies and saw a decrease in cells positive for this staining. A second marker, Lhx6, does not show a difference, suggesting this might be a change in PV expression rather than cell number. They could maybe look into the literature to see if this loss of just the protein also occurs in other models. Overall, the sample size here is a bit smaller than other parts of the paper (n=3), and the methods on the cell counts were less clear, so it is not as clear that this finding is as robust. The authors counted several other broad classes of cells, and those appear normal. Interestingly, there might also be some TBR1 mislocalization in layer 6 that might be significant with added power.

      Thank you for these suggestions. Yes, other models also show this lack of PV expression even when MGE-lineage interneurons are present at normal levels. We mention in the discussion a previous study on the ASD gene CTNNAP2 that showed this. We also agree that there is a trend going on in the Tbr1 population. We assessed another WT and Het pair for Tbr1 laminar distribution and were able to determine that these changes held up and are now significantly different; the person counting these numbers was blind to the genotypes. Finally, we added more details to the methods to describe how the counting was performed.

      The fish data is based on an in situ hybridization for GAD. The measure shown is the width of the positive area in the forebrain. This measure is not one I have seen much before, and has potential to be driven by something unrelated to GABA (e.g., if the whole forebrain were simply a bit smaller). So this analysis could use a couple of other approaches (density of signal?) and/or a control probe for some other brain gene showing the measure is normal, and thus it is not just a size issue.

      To compare altered GABAergic neurons in mice and zebrafish, we tried to isolate zebrafish PV genes and examined their expression by whole-mount in situ hybridization, now included Figure S3 but found no differences. However, we could not find any zebrafish PV gene useful for GABAergic neurons. We chose to examine gad1b expression in the positive area of the forebrain in WT and waca KO zebrafish and then found differences in the brain area with gad1b expression. Since WT and waca KO brain sizes are generally the same we believe this measurement is reasonable to make this conclusion and have added text to the results section to justify.

      (4) Mice were more susceptible to the seizure-inducing agent PTZ

      These data appear well powered, and the findings are robust. The authors also did a fair amount of useful electrophysiology that was all normal, but appeared to be well executed.

      Thank you, we appreciate this confirmation.

      (5) Mice had changes in brain volume that interact with sex

      The authors conducted an MRI on a good number of mice and reported a slight increase in global volume just in males. Sample size is fair, but the statistical approach here may be better if it puts males and females in the same model (to boost power and explicitly test for sex by genotype interaction that they report), and there is some chance that the brain region level differences that they report could include some false positives. They tested many regions, and it is not clear whether or not they corrected for the number of tests. Often, an FDR correction would be used in such imaging studies. It may be that only the most robust regional findings will survive those corrections. It is interesting data either way, but the analysis could be improved.

      Given the 80 regions (bilaterally) that we used and the number of mice, i.e. 6-7, we are underpowered to robustly undertake FDR types of corrections. In the data presented we used t-tests between sex and regions to illuminate putative regional changes. However, we did revisit our MRI data and found three data sets where the results were not normally distributed. We thus changed our statistical test to Mann Whitney for male retrosplenial cortex, male parietal cortex and female corpus callosum, which are now reflected in the figures and differential statistics noted in figure legends.

      (6) Several behaviors are altered in the mice as well

      These studies were fairly well-powered (n=15,16), and they found several positive and negative results, including alterations in memory and sociability in both species. There is a minor statistical flaw in the three-chamber analysis (they don't actually compare the Hets directly to the wildtypes in their statistical testing - a common mistake in neuroscience that should be addressed. But the data look like they will probably still be significant when correctly analyzed. In the supplement, the authors could do a bit more with the data they have to look at hyperactivity (i.e., show total motion in open field, not just time in center vs. periphery), and adding sex to their model might improve sensitivity for genotype effects.

      Thank you for these suggestions. We have done several things to address this behavioral paradigm. First, we added more n’s and also switched from comparing the mouse vs. object to just comparing genotypes as a variable. In addition, we switched to quantifying a discrimination index, described in Phiilips et al., 2019 PMID: 31112129 for our measurement. These new data are shown in Figure 3A. Open field total distance traveled has now been added to Figure S2A. For all other measurements, we did first assess for sex differences but found none and thus compiled both sexes for the graphs.

      (7) Some biochemical signaling pathways are altered in the brain

      These are n=4 immunoblots, and show altered phospho ERK, but no changes in other signaling events predicted from prior WAC literature like H2B ubiquitination. They appear well done, and the authors share the full blots in the supplement.

      Thank you, we appreciate this confirmation. Since Wac is an adaptor protein we needed to test these reported molecular changes in neurons that were previously only reported in cell lines and drosophila. We were not surprised that some of these previously reported changes would not be the same in brain cells. However, it is possible that these changes might arise in more discrete brain regions or at different times during development, which will be tested in our future conditional knockout models.

      (8) WAC deletion also alters gene expression in the brain

      These studies were well-powered for RNAseq, with 10 and 14 samples, using neonates (P2), just the forebrain. The sequencing quality metrics all looked good, and the approach to analysis was okay. It would be stronger to again include sex in the model, rather than separate by sex. There were some typos in this part of the paper that made part of the conclusions unclear, but the RNAseq nicely confirmed the mutation of the mice, and discovered many differentially expressed genes, consistent with the role of this gene as a regulator of transcription. The presentation could be expanded to make more use of the data. Overall, though, this is a useful first characterization of the transcriptome in the line.

      Thank you for the suggestions. We have greatly expanded our assessments of the RNA-seq data. Upon analyzation of the data we found many differences between males and females and now show combined and sex-separated data. Our new data isolate several more extreme and some unique changes in males that are better shown as stand alone figure panels. In addition to these edits, we have also reworked all the text in this section of the results for better reading.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The cause and timing of lethality in the homozygous Wac knockout should be reported or discussed. Investigating Wac homozygous knockout embryos, if viable at early stages, could provide valuable insight into the developmental origins of the neuroanatomical and behavioral phenotypes described in the heterozygous animals. Even a brief histological or transcriptomic characterization of embryonic brains would strengthen the mechanistic understanding of Wac function during neurodevelopment.

      We agree and have collected embryos as early as embryonic day 12.5 from multiple litters but never detected a knockout. We have added this text to the animal methods sections to let readers understand effort had been done to determine when death occurs. While we don’t currently explore this further in mice we now include zebrafish waca; wacb double knockouts. Notably, while we were able to generate a few of these mutants, most died. However, some zebrafish were aged long enough to observe lethal deficits in heart formation and swim bladder development, suggesting that early loss of Wac could impact these critical organs that leads to death.

      (2) A better description of the data reported in Supplementary Tables 3 through 5 is needed. Supplementary Table 3 does not report any statistically significantly differentially expressed genes in the FDR column, and Supplementary Table 5 reports only two, and the reader should understand what the columns are indicating.

      We have now added figure legend text to the supplementary file to explain each Table mentioned here.

      Reviewer #2 (Recommendations for the authors):

      (1) Page 3, last paragraph. The description of wacb is confusing. I recommend that the authors provide the unshown data they mention and also further explanation of the breeding scheme and result. Indeed, if wacb is homozygous lethal, does that make it more like the mouse WAC gene, and thus potentially the more relevant paralogue to study? Are both waca and wacb expressed in the same tissues? How does that compare to mouse and human WAC expression? Such figures about gene expression (even when adapted with permission from public resources like Allen brain atlas or GTEX) are common in this sort of paper, as they can be helpful to understand when and where the gene is thought to act. For waca vs. wacb, they may help determine which gene is more relevant to the brain (for example, if only one is expressed in the brain).

      First, this is a great question and we have now added whole mount in situ for the waca and wacb genes as Figure S1. These data show low to no wacb expression in brain regions while waca is highly expressed there. Since the waca mutants showed phenotypes relevant to DESSH but wacb mutants did not, this correlates with observed expression patterns without fully excluding wacb from any role. Thus, we also made waca/wacb double KO zebrafish that showed a lethal phenotype, similar to homozygous mice mutants. Only a few waca; wacb double knockouts survived a little through development and are now shown in Figure S1. Since wacb KO zebrafish did not show any detectable phenotype on their own, we did not include the data since there are already several figures/tables in this manuscript. However, the waca KO zebrafish did show phenotypes similar to humans with DESSH and are the ones we focused on.

      (2) Why did the authors cross the mice into the outbred CD1 background? Usually, most labs keep the lines on an inbred background. Was there a particular rationale here? I am not saying that they could not outcross them. It is just a bit puzzling why. Perhaps a sentence of explanation in the methods section would be warranted.

      This is a great question and we have now added text to the animal methods section. Many labs that study development, especially on genes critical for survival/life like the Wac gene, use a more robust strain like CD-1. By doing this, we have a better chance of evaluating mutants at more mature ages and getting enough progeny to do more reproducible studies.

      (3) A typical first experiment in a new knockout (fish or mouse) is to establish that the deletion does indeed result in a loss of RNA and protein. In the absence of this, the rest of the paper cannot be as confidently interpreted.

      We did this for the mouse model and found reduced protein expression in the constitutive Het, however this datum is part of the western blots in figure 5. We now mention this in the early results section that protein levels were reduced in the Hets but maintain that the presentation of the western blot is better suited in Fig. 5 to compare to the other western blots. For zebrafish this was attempted but was more difficult. Available antibodies don’t work in zebrafish. RNA expression was attempted in both models and due to Wac being a critical gene for life, there are checks in place to upregulate faulty and normal RNA in the waca model. We screened for frameshift mutations in multiple KO lines and confirmed it by genomic DNA sequencing. In making many KOs and large-scale mutagenesis in zebrafish, we usually depend on phenotype-genotype segregation in Mendelian inheritance for many generations.

      (4) Are these new lines indeed knockouts? I did find a WAC western as part of a later figure for the mouse. The authors may want to mention that earlier, or present at least that data right away. What about in the fish? Is there a way to confirm at the RNA or protein level that it is indeed a null allele?

      Yes, as mentioned in the above response we have now mentioned our Wac western blot results early when introducing the mouse mutants and the issues with doing this in fish are presented above as well.

      (5) Why are fish used that are KO while mice are Hets? Are WAC homozygous mice not viable? This should be mentioned. Regardless, the rationale for examining heterozygous mice and homozygous mutant fish should be provided. Each kind of experiment is useful, but they are interpreted in different ways. Hets will genocopy the patients, who are generally hets, while KOs are often useful for a study of the essential roles of the genes, even if they are not really modeling the patient gene dose.

      Wac homozygous mice in our hands are embryonic lethal, now mentioned in the animal methods section, but we found early on that the Hets mimic several human DESSH patients. In zebrafish it is more complicated. We analyzed waca and wacb hets in zebrafish but found no phenotypes. This could be in part due to some complementation between the waca and wacb genes. It is also possible that a full waca KO could resemble a human DESSH individual since wacb may complement somewhat, even though deleting wacb entirely does not have a measurable phenotype. We have added more text to the discussion to explore these complexities. We also made waca/wacb double KO (dKO) zebrafish but they showed lethal phenotype, similar to homozygous mice mutants and suggesting some complementation by the wacb gene even though alone it did not exhibit phenotypes.

      (6) Figure 3A: It does not appear that the authors are directly statistically comparing the two groups (genotypes) that they are drawing conclusions about. This is an unfortunately common mistake in the neuroscience literature across papers. There is a nice older review about it here. https://pubmed.ncbi.nlm.nih.gov/21878926/. To draw conclusions about the differences between the mouse genotypes, they need to compare the two genotypes directly with a statistical test. See Nygard et al for a recommended approach, like comparing social preference indexes

      (https://onlinelibrary.wiley.com/doi/abs/10.1002/aur.2154).

      Thank you for this information. Previous reviewers at a different journal asked for this particular evaluation. We have now made changes to address the assessment, and graphs now reflect comparisons of genotypes instead of a single genotype between time with a mouse or object. We have also moved to using a social discrimination index to compare the genotypes, similar to the study mentioned.

      (7) MRI - it is a bit weird to separate the male and female brains just for the MRI. Was there a premise from human data to do so? If not, the authors should probably pool them. If they are concerned there are sex effects (or, more likely, a sex by genotype interaction) I recommend that they use a two-factor ANOVA and simply put both sex and genotype into the model. This will also have the advantage of increasing their statistical power for genotype effects a bit. If their current results are robust, they will still show up as a significant sex x genotype interaction.

      All data in the manuscript initially compared the sexes to each other. We have now added this text to the animal section of the methods: For MRI, some zebrafish behaviors and now the RNA-seq data, sex was a difference and due to this observation, sex was (or now is) presented independently for these measurements. We now state that if no sex differences were observed the data were pooled.

      (8) Also, did the authors correct for multiple testing in the MRI analysis? Since they are testing many regions, there is a risk of false positives if they do not. This could be confounded further by their splitting the data by sex, thus doubling the number of tests.

      As noted above we did not do multiple corrections given the large number of regions and low number of replicates.

      (9) How many images per animal were analyzed for the cell counts? This detail is absent from the methods and would help with evaluating the robustness of these findings. What other approaches were used to make sure the counting was unbiased?

      We analyzed 3-4 images per animal for counts and counted hundreds of cells per image. In addition, the person counting was blinded to avoid any bias. These details have now been updated in the methods.

      (10) As with the MRI, for the DEG analysis, I recommend the authors simply put sex and genotype into the same model as two factors (with an interaction), to increase their sensitivity to genotype effects, as well as be able to report on robust genotype x sex differences, if there are any. They may also consider testing the model with and without excluding the three outlier animals on their PCA. It may be that the noise of those outliers is detracting from their sensitivity for DEGs somewhat.

      We greatly expanded our analyses and found more robust and unique changes in males that are now added to Figure 7 and supplemental files. After considering the data, decided to highlight the sex differences separately.

      (11) A few more relatively simple things could readily be done with the RNAseq data to add some depth and interpretation. For example, do the hits here overlap other published IDD/autism DEG lists from mouse knockouts studies of genes like FoxP2, Chd8, Dnmt3a, Myt1l, Tcf4, etc? Do autism genes show up in the lists of hits here? And if so, more than expected by chance? Can they provide some visualization of their GO results in the main figure?

      When we looked into the sex differences more we found that only the males showed significant upregulation of other autism risk genes increase that was previously unappreciated when the sexes were assessed together. Yes, several autism genes do show up but is heavily biased to males. Our main Figure 7 and new supplemental files show new GO term analyses and provide additional data looking not only autism but other factors.

      (12) It appears the IMPC has phenotyped this mouse somewhat, including craniofacial abnormalities. They also report on some blood cell differences. Anyway, if no one has written about that data yet (as it was generated in the context of a big consortium effort), their guidelines may allow you to include some of their data as Supplementary Figures here with proper attribution. It might help to at least summarize useful findings from there in your discussion.

      Due to the large number of figures/tables already in this report we don’t think this will be helpful. However, we do refer readers to the consortium in the animal methods section so they can explore data already generated by the IMPC.

      (13) Minor/Typos:

      (a) Figure 2K: I am confused by the description of three genotypes in the legend, but only two in the panel?

      Corrected.

      (b) I found it a little distracting that some results figures were embedded in the introduction.

      We have moved the figures further in the manuscript to start in the results section.

      (c) I don't understand this sentence: "Due to reduced sample size, sex-stratified DE was performed without model corrections at FDR < 0.1, 7 and found genes significantly upregulated and downregulated, respectively;" The sample size here seemed robust, so I am not sure what they were referring to? Are there missing numbers form this sentence? What is the 7? I think there are enough typos here that I am not sure how to evaluate this claim. Thus, the writing and clarity of this part could be improved.

      This section had several typos that have now been corrected.

      (d) "Marwan Shinawi, (unpublished results)" is a bit atypical of a citation. Are these results being reported with his permission? If so, then it should say 'personal communication' (if the journal permits this - some do not). If not, they should not report someone else's unpublished results without their explicit permission. It might upset some people to have their results presented this way.

      We have changed unpublished results to personal communication. Marwin Shinawi is an author on this manuscript and has approved of everything we have reported.

      (e) In all figures, consider shape or color coding for sex, even when pooling the data (e.g, the data points in the behavior figures).

      This is a good idea but since we found no difference when analyzing the data we don’t see how this extra work will make a difference. Since we now mention that sex differences were only presented as separate graphs when observed in the methods we think this should be acceptable.

    1. eLife Assessment

      This important study reports that an oncogenic population in an epithelium can either be repressed or spread, depending on the tissues. This work provides convincing evidence, supported by pharmacological perturbations and numerical simulations using the vertex model, that the principle of "high heterotypic interfacial tension" that appears to drive cell sorting and tissue segregation in embryonic models similarly applies to cancer cell behaviour.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      The behaviour of cells expressing constitutively active HRas is examined in mosaic monolayers, both in MCF10a breast epithelial and Beas2b bronchial epithelial cell lines, mimicking the potential initial phase of development of carcinoma. Single HRas-positive cells are excluded from MCF10a but not Beas2b monolayers. Most interestingly, however, when in groups, these cells are not excluded, but rather sharply segregated within a MCF10a monolayer. In contrast, they freely mix with wt Beas2b cells. Biophysical analysis identifies high tension at heterotypic interfaces between HRas and wild-type cells as the likely reason for segregation of MCF10a cells. The hypothesis is supported experimentally, as myosin inhibition abolishes segregation. The probable reason for lack of segregation in the bronchial epithelium is to be found in the different intrinsic properties of these cells, which form a looser tissue with lower basal actomyosin activity. The behaviour of single cells and groups is recapitulated in a vortex model based on the principle of differential interfacial tension, under the condition of high heterotypic interfacial tension.

      Strengths:

      Despite being long recognized as a crucial event during cancer development, segregation of oncogenic cells has been a largely understudied question. This nice work addresses the mechanics of this phenomenon through a straightforward experimental design, applying the biophysical analytical approaches established in the field of morphogenesis. Comparison between two cell types provides some preliminary clues on the diversity of effects in various cancers.

    3. Reviewer #2 (Public review):

      Summary:

      The authors investigate the behavior of oncogenic cells in mammary and bronchial epithelia. They observe that individual oncogenic cells are preferentially excluded from the mammary epithelium, but they remain integrated in the bronchial epithelium. They also observe that clusters of oncogenic cells form a compact cluster in mammary epithelium, but they disperse in the bronchial epithelium. The authors demonstrate experimentally and in the vertex model simulations that the difference in observed behavior is due to the differential tension between the mutant and wild-type cells due to a differential expression of actin and myosin.

      Strengths:

      (1) Very detailed analysis of experiments to systematically characterize and quantify differences between mammary and bronchial epithelia.

      (2) Detailed comparison between the experiments and vertex model simulations to identify the differential cell line tension between the oncogenic and wild-type cells as one of the key parameters that are responsible for the different behavior of oncogenic cells in mammary and bronchial epithelia.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The behaviour of cells expressing constitutively active HRas is examined in mosaic monolayers, both in MCF10a breast epithelial and Beas2b bronchial epithelial cell lines, mimicking the potential initial phase of development of carcinoma. Single HRas-positive cells are excluded from MCF10a but not Beas2b monolayers. Most interestingly, however, when in groups, these cells are not excluded, but rather sharply segregated within a MCF10a monolayer. In contrast, they freely mix with wt Beas2b cells. Biophysical analysis identifies high tension at heterotypic interfaces between HRas and wild-type cells as the likely reason for segregation of MCF10a cells. The hypothesis is supported experimentally, as myosin inhibition abolishes segregation. The probable reason for lack of segregation in the bronchial epithelium is to be found in the different intrinsic properties of these cells, which form a looser tissue with lower basal actomyosin activity. The behaviour of single cells and groups is recapitulated in a vortex model based on the principle of differential interfacial tension, under the condition of high heterotypic interfacial tension.

      Strengths:

      Despite being long recognized as a crucial event during cancer development, segregation of oncogenic cells has been a largely understudied question. This nice work addresses the mechanics of this phenomenon through a straightforward experimental design, applying the biophysical analytical approaches established in the field of morphogenesis. Comparison between two cell types provides some preliminary clues on the diversity of effects in various cancers.

      Weaknesses:

      Although not calling into question the main message of this study, there are a few issues that one may want to address:

      (1) One may be careful in interpreting the comparison between MCF10a and Beas2b cells as used in this study. The conditions may not necessarily be representative of the actual properties of breast and bronchial epithelia. How much of the epithelial organization is reconstituted under these experimental conditions remains to be established. This is particularly obvious for bronchial cells, which would need quite specific culture conditions to build a proper bronchial layer. In this study, they seemed to be on the verge of a mesenchymal phenotype (large gaps, huge protrusions, cells growing on top of each other, as mentioned in the manuscript).

      As an alternative to Beas2b, comparison of MCF10a with another cell line capable of more robust in vitro epithelial organization, but ideally with different adhesive and/or tensile properties, would be highly interesting, as it may narrow down the parameters involved in segregation of oncogenic cells.

      (2) While the seminal description of tissue properties based on interfacial tensions (Brodland 2002) is clearly key to interpreting these data, the actual "Differential Interfacial Tension Hypothesis" poses that segregation results from global differences, i.e., juxtaposition of two tissues displaying different intrinsic tensions. On the contrary, the results of the present work support a different scenario, where what counts is the actual difference in tension ALONG the tissue boundary, in other words, that segregation is driven by high HETEROTYPIC interfacial tension. This is an important distinction that should be clarified.

      (3) Related: The fact that actomyosin accumulates at the heterotypic interface is key here. It would be quite informative to better document the pattern of this accumulation, which is not clear enough from the images of the current manuscript: Are we talking about the actual interface between mutant and wt cells (membrane/cortex of heterotypic contacts)? Or is it more globally overactivated in the whole cell layer along the border? Some better images and some quantification would help.

      (4) In the case of Beas2b cells, mutant cells show higher actin than wt cells, while actin is, on the contrary, lower in mutant MCF10a cells (Figure 2b). Has this been taken into account in the model? It may be in line with the idea that HRas may have a different action on the two cell types, a possibility that would certainly be worth considering and discussing.

      Comments on revisions:

      There is still one last point that should be made even clearer:

      The system is being modelled based on the principle of INTERFACIAL TENSION, a description pioneered by the works of Steinberg and of Harris, and nicely conceptualized by Brodland (2002). Now the observed behaviour is a perfect case of sorting based on higher interfacial tension AT the boundary between cell types (with nice additional documentation of local actin and myosin enrichment in the revised manuscript). What needs to be made crystal clear it that this is NOT equivalent to the model of DITH ("DIFFERENTIAL INTERFACIAL TENSION HYPOTHESIS)" (Brodland 2002, Krieg et al 2008). It is important to stop using DITH in this context, as it leads to confusion and misinterpretations. Indeed, DITH predicts cell/tissue sorting based on differences in interfacial tension WITHIN the two cell types. While DITH accounts for relative POSITIONING (one tissue engulfing the other), it is now established that this is not the motor for cell sorting and tissue segregation, the key parameter is being heterotypic tension at the heterotypic interface. I thus invite the authors to avoid the terms "differential"/DITH, and rather use either "interfacial tension", or specifically to "HIGH HETEROTYPIC INTERFACIAL TENSION".

      Related: the authors correctly cite Canty et al NatComm2017 when discussing this phenomenon. I suggest to add an additional key supporting reference "D.M. Sussman, J.M. Schwarz, M.C. Marchetti, M.L. Manning, Soft yet sharp interfaces in a vertex model of confluent tissue, Phys. Rev. Letters 120 (2018) 058001". One may also include another pioneer work in Drosophila is "M. Aliee, J.C. Roper, K.P. Landsberg, C. Pentzold, T.J. Widmann, F. Julicher, C. Dahmann, Physical mechanisms shaping the Drosophila dorsoventral compartment boundary, Curr. Biol. 22 (2012) 967-976."

      We thank the reviewer for this important clarification. We fully agree that the mechanism underlying the observed segregation in our system is best described in terms of elevated heterotypic interfacial tension, rather than the classical Differential Interfacial Tension Hypothesis (DITH). As the reviewer correctly points out, DITH in its original formulation refers to differences in intrinsic interfacial tensions within each cell population, which primarily governs relative positioning (e.g., tissue engulfment), rather than the local sorting dynamics we observe here.

      In contrast, our experimental and modeling results support a scenario in which segregation is driven by increased tension specifically at heterotypic interfaces between HRasV12 and wild-type cells. We agree that continued use of the term “Differential interfacial tension” in this context may lead to conceptual ambiguity.

      Accordingly, we have revised the manuscript throughout to replace references to “differential interfacial tension” with more precise terminology, namely “interfacial tension” or “heterotypic interfacial tension”, wherever appropriate. We have also updated the Discussion to explicitly clarify this distinction and its implications for interpreting our results.

      We thank the reviewer for suggesting additional relevant literature which have now included.

      Reviewer #2 (Public review):

      Summary:

      The authors investigate the behavior of oncogenic cells in mammary and bronchial epithelia. They observe that individual oncogenic cells are preferentially excluded from the mammary epithelium, but they remain integrated in the bronchial epithelium. They also observe that clusters of oncogenic cells form a compact cluster in mammary epithelium, but they disperse in the bronchial epithelium. The authors demonstrate experimentally and in the vertex model simulations that the difference in observed behavior is due to the differential tension between the mutant and wild-type cells due to a differential expression of actin and myosin.

      Strengths:

      Very detailed analysis of experiments to systematically characterize and quantify differences between mammary and bronchial epithelia

      Detailed comparison between the experiments and vertex model simulations to identify the differential cell line tension between the oncogenic and wild-type cells as one of the key parameters that are responsible for the different behavior of oncogenic cells in mammary and bronchial epithelia

      Weaknesses:

      It is unclear what is the mechanistic origin of the shape-tension coupling, which is used in the vertex model, and how important that coupling is for the presented results. Authors claim that the shape-tension coupling is due to the anisotropic distribution of stress fibers when cells are under external stress. It is unclear why the stress fibers should affect an effective line tension on the cell boundaries and why the stress fibers should be sensitive to the magnitude of the internal isotropic cell pressure. In experiments, it makes sense that stress fibers form when cells are stretched. Similar stress fibers form when cytoskeleton or polymer networks are stretched. It is unclear why the stress fibers should be sensitive to the magnitude of internal isotropic cell pressure. If all the surrounding cells have the same internal pressure, then the cell would not be significantly deformed due to that pressure and stress fibers would not form. Authors should better justify the use of the shape-tension coupling in the model, since most of the observed behavior is already captured by the differential tension even if there is no shape-tension coupling.

      We thank the reviewer for this comment. We agree that we did not provide a mechanistic origin for the shape-tension coupling. In our model, stress fiber formation, along with actin ring formation, indicated that cells at the interface were elongated. Hence, we hypothesised that an interfacial force could induce nematic alignment at the interface. However, such an activity would only be feasible if the interface interaction were sufficiently high. Thus, the isotropic pressure at the heterotypic interface served as a proxy for cell-cell interactions in our model. However, inspired by recent work [1], we have tested whether activation of cells at the interface by shear stress would produce similar results. Exploring this aspect will require additional simulations.

      (1) Pérez-Verdugo, F., Maniou, E., Galea, G. L., & Banerjee, S. (2026). Mechanosensitive feedback organizes cell shape and motion during hindbrain neuropore morphogenesis. Current Biology.

      The observed difference of shape indices between the interfacial and bulk cells in simulations in the absence of differential line tension is concerning. This suggests that either there are not enough statistics from the simulations or that something is wrong with the simulations. For all presented simulation results, the authors should repeat multiple simulations and then present both averages and standard deviations. This way it would be easier to determine whether the observed differences in simulations are statistically significant.

      The observed differences in shape indices between interfacial and bulk cells in simulations in the zero-line-tension case (Lambda=0) remain non-zero at the zero-stress threshold because the interface cells are still subject to the shape-dependent contribution gamma_ij, since the current model treats gamma_ij as independent of Lambda. We are exploring the possible relationship between Lambda and gamma_ij, and we will update this in the next version of the manuscript.

      Recommendations for the authors:

      The editor recommends considering the new comment made by reviewer #1 in his/her report:

      "There is still one last point that should be made even more clear:

      The system is being modelled based on the principle of INTERFACIAL TENSION, a description pioneered by the works of Steinberg and of Harris, and nicely conceptualized by Brodland (2002). Now the observed behaviour is a perfect case of sorting based on higher interfacial tension AT the boundary between cell types (with nice additional documentation of local actin and myosin enrichment in the revised manuscript). What needs to be made crystal clear it that this is NOT equivalent to the model of DITH ("DIFFERENTIAL INTERFACIAL TENSION HYPOTHESIS)" (Brodland 2002, Krieg et al 2008). It is important to stop using DITH in this context, as it leads to confusion and misinterpretations. Indeed, DITH predicts cell/tissue sorting based on differences in interfacial tension WITHIN the two cell types. While DITH accounts for relative POSITIONING (one tissue engulfing the other), it is now established that this is not the motor for cell sorting and tissue segregation, the key parameter is being heterotypic tension at the heterotypic interface. I thus invite the authors to avoid the terms "differential"/DITH, and rather use either "interfacial tension", or specifically to "HIGH HETEROTYPIC INTERFACIAL TENSION".

      Related: the authors correctly cite Canty et al NatComm2017 when discussing this phenomenon. I suggest to add an additional key supporting reference "D.M. Sussman, J.M. Schwarz, M.C. Marchetti, M.L. Manning, Soft yet sharp interfaces in a vertex model of confluent tissue, Phys. Rev. Letters 120 (2018) 058001". One may also include another pioneer work in Drosophila is "M. Aliee, J.C. Roper, K.P. Landsberg, C. Pentzold, T.J. Widmann, F. Julicher, C. Dahmann, Physical mechanisms shaping the Drosophila dorsoventral compartment boundary, Curr. Biol. 22 (2012) 967-976."

      Please see response to Reviewer 1

      Reviewer #2 (Recommendations for the authors):

      The authors have improved the manuscript and addressed some of my concerns. However, some of the questions were not adequately addressed.

      (1) I appreciate additional justification regarding the need for the shape-tension coupling in the vertex model. However, the authors have not answered my question regarding why the shape-tension coupling model should be sensitive to the magnitude of the internal isotropic cell pressure. In experiments, it makes sense that stress fibers form when cells are stretched, but it is unclear why the stress fibers should be sensitive to the magnitude of internal isotropic cell pressure. If all the surrounding cells have the same internal pressure, then the cell would not be significantly deformed due to that pressure, and stress fibers would not form.

      We thank the reviewer for pointing this out. We agree that we did not provide a mechanistic origin for the shape-tension coupling. In our model, stress fiber formation, along with actin ring formation, indicated that cells at the interface were elongated. Hence, we hypothesized that an interfacial force could induce nematic alignment at the interface. However, such an activity would only be feasible if the interface interaction were sufficiently high. Thus, the isotropic pressure at the heterotypic interface served as a proxy for cell-cell interactions in our model.

      However, inspired by recent work [1], we have tested whether activation of cells at the interface by shear stress would produce similar results. Exploring this aspect will require additional simulations.

      (1) Pérez-Verdugo, F., Maniou, E., Galea, G. L., & Banerjee, S. (2026). Mechanosensitive feedback organizes cell shape and motion during hindbrain neuropore morphogenesis. Current Biology.

      (2) I appreciate that the authors provided additional statistics related to simulations. I am still very concerned about the observed difference in the shape indices between the cells at the interface and the bulk, when the interfacial line tension is exactly zero (Lambda=0). In that case, the cells at the interface and at the boundary are identical, and there should be no difference in the shape indices. Are cells at the interface for the zero-line tension case (Lambda=0) still subject to the shape dependent contribution gamma_ij? If that contribution is still included for the cells at the interface, then this could explain why cells at the interface are still different from cells in the bulk even when Lambda=0.

      The observed differences in shape indices between interfacial and bulk cells in simulations in the zero-line-tension case (Lambda=0) remain non-zero at the zero-stress threshold because the interface cells are still subject to the shape-dependent contribution gamma_ij, since the current model treats gamma_ij as independent of Lambda. We are exploring the possible relationship between Lambda and gamma_ij, and we will update this in the next version of the manuscript.

      (3) Authors included several additional supplemental figures (Figs. S4, S5, S6, S7) , but they are not discussed in the manuscript text. These new supplemental figures were only discussed in the rebuttal letter. These figures should also be discussed in the manuscript text.

      We have cited the new supplementary figures in the main text.

      (4) Authors have answered in the rebuttal letter what experimental data was used in Fig. 4c. This information also needs to be provided in the manuscript text.

      We have added this information in the caption of Figure 4

      (5) Supplementary Figure 3 is missing. That figure got moved to the appendix.

      This has been rectified in the Supplementary file and the citations have been updated accordingly in the main text.

      (6) At the end of section 4 in the main text, the authors introduced a new sentence regarding simulations of the vertex model with interfacial tension and mechanochemical feedback. The details of that model are described in the appendix, but it would be helpful to add a sentence or two already in the main text describing what is the mechanism of the mechanochemcial feedback.

      We have added a line describing the mechanism of mechanochemical feedback.

      (7) In the definition of the eccentricity, 'a' should be the minor axis and 'b' the major axis, i.e., 'a' and 'b' should be swapped.

      We have corrected this.

      (8) There is a typo at the end of the vertex model description in the methods section. "The details of the shape-tension coupling is described in the interface." The word interface should be an appendix.

      We have fixed the typo.

      (9) In the appendix section describing the shape-tension coupling, the authors should explain how the cell's director n is defined.

      We have added a line in the appendix section describing shape-tension coupling explaining how the cell’s director n is defined.

      (10) In Appendix Fig. 1, the two angles are defined as theta and theta' but the figure caption is defining angles theta_1 and theta_2. These angles need to be consistent.

      This has been fixed.

    1. eLife Assessment

      This study presents valuable findings on phase-separated condensate formation by the MUT-16 protein, which plays a key role in small RNA biogenesis. A detailed analysis of the interactions governing condensate formation was carried out using coarse-grained and all-atom molecular dynamics simulations, complemented by in vitro phase separation experiments. While many of the results appear solid, a number of technical details are lacking, the computational part appears incomplete and would benefit from additional analyses and clarifications, and the novelty of the study should also be clarified, particularly in comparison with the authors' previous work on MUT-16. Overall, the work will be of interest to biophysicists and molecular biologists studying phase separation and biomolecular condensates.

    2. Reviewer #1 (Public review):

      In this work, Gaurav et al. present an extensive study of phase-separated condensates formed by the foci-forming region (FFR) of the MUT-16 protein. The authors first report in vitro experiments showing that these condensates exhibit upper critical solution temperature (UCST) behavior. They then provide a detailed analysis based on atomistic simulations of MUT-16 FFR condensates, identifying key interactions responsible for LLPS, including salt bridges, cation-π interactions, and the role of Na⁺ ions.

      Overall, the manuscript is well written. However, there are several concerns that should be addressed.

      Major Concerns:

      (1) I have several questions regarding the system preparation that require clarification. The authors state that "65 copies of the coarse-grained MUT-16 FFR were embedded in a slab-shaped simulation," but it is not clear how this initial configuration was generated. Were the molecules randomly distributed in the simulation box, or were they initially arranged in a preformed condensate? Alternatively, were they randomly inserted and allowed to self-assemble into a condensate during NpT simulations?

      In Figure 1, the atomistic snapshot appears to show a well-defined condensate at the center of the simulation box. It would be important to clarify how this configuration was obtained: Was it generated from coarse-grained simulations starting from random initial conditions? Or was a preassembled condensate used as input?

      Related to this, how do the authors ensure that the simulations are equilibrated? While 20 μs appears to be a reasonably long simulation time for coarse-grained simulations, it would be useful to demonstrate equilibration explicitly. For example, the authors could plot the center-of-mass positions (in the long axis of the simulation box) of individual proteins over time to show that all molecules reach a steady state and remain within the condensate without systematic drift.

      (2) The authors experimentally observe UCST behavior for these condensates. Do the coarse-grained or atomistic simulations reproduce this behavior?

      While atomistic simulations may be too computationally demanding to systematically explore temperature dependence, coarse-grained simulations could be used to test whether condensates are stable at lower temperatures and dissolve at higher temperatures. Such an analysis would provide valuable support for the experimental observations.

      (3) Regarding the analysis of ions, several points could be clarified and extended:

      a) It would be helpful to report the total number of ions and quantify how many are located inside vs. outside the condensate. While qualitative trends can be inferred from density profiles, quantitative analysis would strengthen the conclusions.

      b) It would also be interesting to analyze the number of contact ion pairs (e.g., Na⁺-Cl⁻ pairs), as described in J. Chem. Phys. 156, 044505 (2022). It is known that some ion models tend to overestimate ion pairing and underestimate solubility (e.g., J. Chem. Phys. 153, 010903 (2020)).

      c) In this context, the use of scaled-charge models has been shown to improve the description of ionic solutions and biomolecular systems (e.g., J. Phys. Chem. Lett. 2019, 10, 23, 7531-7536). I would suggest that, at least for one trajectory, the authors perform a test simulation using scaled charges (e.g., scaling by ~0.8) to evaluate whether ion distributions and protein-ion interactions are significantly affected.

      d) Finally, while the selected water model is known to be accurate, it would be useful to assess its performance for concentrated salt solutions. For example, the authors could estimate the density of a 6 m salt solution and compare it with experimental data or validated models (e.g., J. Chem. Phys. 151, 134504 (2019)). This would help clarify to what extent the conclusions depend on the chosen force field.

      Minor Concerns

      (1) In the Introduction, it would be helpful to elaborate further on the possible driving forces of LLPS in this region. Are there prior hypotheses or evidence pointing to specific interactions (e.g., cation-π, π-π, electrostatic interactions)? While this work addresses these questions, a brief discussion of previous experimental or theoretical insights would provide useful context.

      (2) On page 18, the authors state:<br /> "MUT-16 FFR satisfies the length (172 residues), aromatic content (20.35%), and Arg enrichment (85.71%) criteria. Its charge content (10.47%) and charge balance (38.89% positive charge fraction) are slightly below the nominal thresholds."<br /> It would be very helpful to include a schematic representation of the protein sequence highlighting these features (aromatic residues, charge distribution, etc.) in the corresponding figure, to provide a more intuitive understanding.

      (3) A question regarding ion hydration: What is the coordination environment of the ions that bridge proteins? Are they still hydrated by water molecules, or does the reduced water content inside the condensate significantly affect their solvation?<br /> Typically, Na⁺ and Cl⁻ ions have coordination numbers around 5-6 in aqueous solution. Do protein interactions and reduced solvent conditions within the condensate alter this coordination? A brief analysis or discussion would be valuable.

    3. Reviewer #2 (Public review):

      Summary:

      Gaurav et al. investigate residue-level interactions within the MUT-16 FFR condensate using all-atom molecular dynamics simulations. The authors first argue, based on sequence analysis, that MUT-16 FFR is more representative than the widely studied FUS LCD. They then characterize the UCST phase behavior of MUT-16 FFR experimentally, followed by a detailed analysis of residue-level contact frequencies and lifetimes. In addition, the manuscript examines ion-residue interactions and water-mediated interactions. Overall, this work provides a comprehensive view of the dynamic interactions within the MUT-16 FFR condensate.

      Strengths:

      Large-scale all-atom molecular dynamics simulations have been performed to investigate dynamical interactions within condensates. The analysis is comprehensive and rigorous, and the claims are strongly justified by the data.

      Weaknesses:

      The large amount of detail in the results section sometimes makes it difficult to identify the central take-home messages. I encourage the authors to more clearly highlight the principal findings and the physical insights that may generalize to other condensate-forming systems. The authors may also consider streamlining parts of the Results section to improve focus and readability.

    4. Reviewer #3 (Public review):

      Summary:

      The authors aim to characterize the molecular interaction network inside phase-separated condensates formed by the MUT-16 foci-forming region (FFR), using atomistic simulations combined with residue-resolved analyses of contact frequencies, contact lifetimes, specific non-covalent interactions, ions, and water.

      Strengths:

      The work addresses an interesting and biologically relevant system, and the combination of large-scale atomistic simulations with an extensive contact analysis has clear potential value for the broader condensate field.

      Weaknesses:

      In its current form, several technical issues need to be addressed before the main conclusions can be considered robust. Most importantly, the simulated sequence is 172 residues long, while the atomistic slab has box dimensions of only 12 nm in two directions. This length scale is comparable to the expected end-to-end distances of a disordered 172-residue chain. It is therefore not clear whether individual protein chains interact with their own periodic images, which could substantially affect overall chain dynamics and subsequently bias contact lifetimes, residue-residue interaction statistics, and the inferred condensate dynamics. The authors should check, for each chain, histograms of end-to-end distances. For chains for which more than ~2-3% of the end-to-end distances exceed ~11 nm, the authors should explicitly check for self-image interactions (for example, using "gmx mindist -pi") and report whether such interactions occur and for what fraction of the trajectory. Without this control, at least in the Supporting Information, I do not think the simulation-derived contact dynamics are sufficiently trustworthy.

      A second major concern is the treatment of ions. The manuscript makes important conclusions about Na⁺ association and Na⁺-mediated bridging, but the atomistic ion model is not explicitly stated. This is a reproducibility problem and also affects interpretation - for example, standard Amber ions are known to bind too strongly to the oppositely charged residues. In their results, one acidic residue appears to interact on average with roughly two Na⁺ ions, which is not obviously expected from charge balance alone. The authors should state the exact Na⁺/Cl⁻ parameters used, justify their compatibility with TIP4P-D and the protein force field, and explicitly interpret why such a strong Na⁺ association with acidic residues is observed.

      More generally, because the manuscript is centered on contact lifetimes, the choice of the atomistic force field needs stronger justification. Salt bridges, cation-pi contacts, pi-pi stacking, ion coordination, and water-mediated interactions are all force-field-sensitive. Since there is no direct experimental observable used here to validate the simulations, the authors should discuss the expected limitations of the chosen force field (while I do acknowledge that testing different force fields would be computationally too demanding).

      I also find the sequence-comparison section somewhat confusing. The authors compare one specific IDR, MUT-16 FFR, with the average properties of human IDRs and then frame it as more representative than FUS LCD. It is not clear how informative this is because IDR behavior depends strongly on sequence-specific patterning, molecular connectivity, and the particular interaction network of each protein. Averages over human IDRs may provide a broad context, but they do not necessarily define what is physically or biologically representative for phase separation. In addition, FUS LCD is not intended to be a representative human IDR; it is an unusually low-complexity, phase-separating domain. Therefore, the "more representative than FUS" framing should be toned down. At most, this analysis shows that MUT-16 FFR is compositionally less extreme than FUS LCD.

      The ion- and water-bridging analyses are also potentially overinterpreted. A distance-based simultaneous contact with two residues does not by itself establish functional mediation or regulation of condensate dynamics. The authors should either add appropriate controls, such as local-density-normalized baselines or randomized-contact expectations, or soften the language to describe these as geometrically defined co-contact events rather than mechanistic bridging interactions.

      Finally, the independence of the atomistic replicas is unclear. The manuscript should state whether all ten all-atom simulations were initiated from the same coarse-grained condensate configuration or from distinct CG frames. If the starting structures came from one CG trajectory, the authors should report how far apart those frames were in simulation time and provide evidence that the initial atomistic configurations are structurally independent. If only velocities differ, the simulations should not be described as fully independent structural replicas.

    1. eLife Assessment

      Overall, this is a manuscript with solid evidence that delivers an important community resource for those performing experimental research in amyotrophic lateral sclerosis. The authors address the lack of validated tools for the detection and quantification of proteins associated with amyotrophic lateral sclerosis (ALS) through an extensive screening of 303 commercially available antibodies to 33 protein targets. The effort invested in generating the knockout lines for validation experiments is a clear strength of the study.

    2. Reviewer #1 (Public review):

      Summary:

      The authors address the lack of validated tools for the detection and quantification of proteins associated with amyotrophic lateral sclerosis (ALS) through an extensive screening of 303 commercially available antibodies to 33 protein targets. Their ALS-Reproducible Antibody Platform (ALS-RAP) delivers a validated antibody toolbox for ALS research, which will provide an advantageous starting point for researchers in this field. Ayoubi R. et al. showcase the characterization workflow, presenting as an example the characterization of antibodies targeting Galectin-1, encoded by the LGALS1 gene. A selection of these antibodies was also used to profile protein levels across human induced pluripotent stem cell (iPSC)-derived and primary neurological cell types, and the findings support that the ALS disease mechanism involves both neuronal and glial cells.

      Strengths:

      The knockout (KO)-based approach is definitely the major strength of this study, providing a high level of confidence in the data collected in human induced pluripotent stem cell (iPSC)-derived and primary neurological cell types. The focus on renewable reagents (monoclonal and recombinant antibodies) is also important. The extensive characterization of this set of antibodies will benefit any scientist interested in any of the 33 target proteins, even in fields other than neuroscience.

      The authors perform an interesting protein profiling study assessing 27 proteins, comparing RNA and protein expression data, and using two independent WB preparations of the same cell types.

      The conclusions that can be drawn from this first assessment might not be final, but the data are compelling because they have been collected with reliable and validated antibodies.

      Another strength of this work is the data dissemination strategy, which includes the Only Good Antibodies (OGA) platform, where YCharOS data are curated and presented in an easy and intuitive manner that facilitates antibody selection by the end user for WB, IP and IF applications.

      Weaknesses:

      The authors mentioned the development of single-chain variable fragment (scFv) recombinant antibodies raised by the SGC against the six proteins (ANXA11, OPTN, MATR3, PFN1, UBQLN2 and VCP) that had limited renewable antibodies that are commercially available. The development was optimized to generate antibodies particularly suitable for IP, and the clone selection process was carried out using IP coupled to mass spectrometry. Even though the generation of these novel reagents is not the focus of this work, the authors do not provide any data on this aspect.

      The protein profiling study is limited to WB data, and the authors did not provide any explanation on why there was no integration with IP and IF data, not even for those targets that have validated antibodies. Also, not all the cell types have been screened by chemiluminescence-based detection and by fluorescence-based WB, and the authors do not elaborate on the reason for such a choice.

    3. Reviewer #2 (Public review):

      Overall, this is a solid manuscript that delivers an important community resource. The execution is relatively simple, but the value is real, the work is rigorously performed, and the open dissemination through Zenodo, the F1000Research YCharOS Gateway and OGA is well executed. The effort invested in generating the knockout lines for validation experiments is a clear strength of the study. I have a number of comments that I think would strengthen the resource and the conclusions drawn from it.

      Below, I list specific points.

      (1) The rationale for the selection of these 33 genes is insufficient. The authors lean on the Nijs & Van Damme classification and on PubMed entry counts, but the number of PubMed entries is not a meaningful criterion for what constitutes an important ALS protein - some of the most disease-relevant genes are precisely those with fewer publications, while heavily cited genes such as CAV1 carry weak ALS-specific evidence. The authors should provide a more transparent and biologically motivated rationale for inclusion and exclusion (ClinGen evidence tier, replicated GWAS signals, large meta-analyses, ALSoD) and explain why specific risk genes outside this list were not part of ALS-RAP.

      (2) "107 of 231 (46%) demonstrated specific target staining in IF." The criteria used to define "specific target staining" at the IF level are not stated. From the Galectin-1 example, the mosaic WT/KO strategy provides a binary readout, but for proteins with low expression, weak punctate staining or unusual subcellular distributions, a single threshold is unlikely to capture specificity uniformly across 231 antibodies.

      (3) Several claims in the manuscript depend on differential protein abundance across cell types. As presented, these claims are supported by qualitative Western blot images only. They should be substantiated by quantification across multiple biological replicates.

      (4) This manuscript represents a unique opportunity to address antibody recognition of splicing variants, which is something of of considerable value to the community. For each target, the predicted isoforms in Ensembl could be cross-referenced against the observed bands, and the pattern of bands compared across cell types could be informative about which isoforms each antibody captures. This would convert ambiguous "extra bands" into useful biological information and would substantially increase the value of the resource. I strongly encourage the authors to include this analysis.

      (5) The iPSC-derived microglia receive a comprehensive QC panel (IBA1/PU.1 IF, CD45/CD11b flow, qRT-PCR for nine canonical markers; Figure S4), which allows the reader to assess culture purity. The other iPSC-derived lineages - motor neurons, dopaminergic neurons, oligodendrocytes and astrocytes - are validated by a single marker each in WB (Figure S3) without purity quantification. Given that several conclusions of the manuscript rest on the cell-type-specific detection of ALS-associated proteins, equivalent quality control should be performed for the other lineages so that the reader can evaluate the purity of each preparation.

      (6) The robustness of the resource would be substantially increased by validating at least a subset of the targets in a second iPSC background, in at least some of the cell types analysed.

      (7) The newly developed SGC scFv antibodies are arguably the most novel reagent contribution of this manuscript, yet they receive a single sentence in the body of the paper. A more thorough description is warranted.

      (8) Accessibility of the resource through Zenodo is not straightforward - the reader currently has to navigate to individual antibody characterization reports one by one to extract recommendations for a given target. While the use of an established public repository is important for permanence, a dedicated ALS-RAP website with an interactive, searchable interface - filterable by target, application, host species and clonality - would meaningfully improve uptake. The relationship between such a portal and the existing OGA platform should also be clarified.

    1. eLife Assessment

      Non-essential amino acids such as glutamine have been known to be required for T cell general activation through sustaining basic biosynthetic processes, including nucleotide biosynthesis, ATP generation, and protein synthesis. In this important study, the authors found that extracellular asparagine (Asn) is required not only for T cells to generally refuel metabolic reprogramming, but to produce helper T cell lineage-specific cytokine, for instance, IL17. In particular, the importance of Asn in IL17 production was convincingly demonstrated in the mouse experimental autoimmune encephalomyelitei (EAE) model, mimicking human multiple sclerosis disease.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors reveal that the availability of extracellular asparagine (Asn) represents a metabolic vulnerability for the activation and differentiation of naive CD4+ T cells. To deplete extracellular Asn, they employed two orthogonal approaches: activating naive CD4+ T cells in either PEGylated asparaginase (PEG-AsnASE)-treated medium or custom-formulated RPMI medium specifically lacking Asn. Importantly, they demonstrate that Asn depletion not only impaired metabolic reprogramming associated with CD4+ T cell activation but also reduced CD4+ helper T cell lineage-specific cytokine production, thereby ameliorating the severity of experimental autoimmune encephalomyelitis.

      The experiments presented here are comprehensive and well-designed, providing compelling evidence for the conclusions. The conclusions will be important to the field.

      Comments on revised version:

      The authors have sufficiently addressed my previous comments. The manuscript represents an excellent contribution to the field.

    3. Reviewer #2 (Public review):

      While the importance of asparagine in the differentiation and activation of CD8 T cells has been previously reported, its role in CD4 T cells remained unclear. Using culture media containing specific amino acids, the authors demonstrated that extracellular asparagine promotes CD4 T cell proliferation. Consistent with this, depletion of extracellular asparagine using PEG-AsnASE suppressed CD4 T cell activation. Proteomic analysis focusing on asparagine content revealed that, during the early phase of T cell activation, most asparagine incorporated into proteins is derived from extracellular sources. The authors further confirmed the importance of extracellular asparagine in vivo, demonstrating improved EAE pathology.

      While the data are well organized and convincing, the mechanism by which asparagine deficiency leads to altered T cell differentiation remains unclear. It is also necessary to investigate the transporters involved in asparagine uptake. In particular, elucidating whether different T cell subsets utilize the same or distinct transport mechanisms would provide important insight into the immunoregulatory role of asparagine.

      Comments on revised version:

      The authors have addressed the previous concerns, and the manuscript has been significantly improved.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors reveal that the availability of extracellular asparagine (Asn) represents a metabolic vulnerability for the activation and differentiation of naive CD4+ T cells. To deplete extracellular Asn, they employed two orthogonal approaches: activating naive CD4+ T cells in either PEGylated asparaginase (PEG-AsnASE)-treated medium or custom-formulated RPMI medium specifically lacking Asn. Importantly, they demonstrate that depletion not only impaired metabolic reprogramming associated with CD4+ T cell activation but also reduced CD4+ helper T cell lineage-specific cytokine production, thereby ameliorating the severity of experimental autoimmune encephalomyelitis.

      Strengths:

      The experiments presented here are comprehensive and well-designed, providing compelling evidence for the conclusions. The conclusions will be important to the field.

      We thank the reviewer for their assessment of our work and enthusiasm towards our findings.

      Weaknesses:

      (1) EAE is the prototypic T cell-mediated autoimmune disease model, and both Th1 and Th17 cells are implicated in its pathogenesis. In contrast, Th2 and Treg cells and their associated cytokines (such as IL-4 and IL-10) have been shown to play a role in the resolution of EAE, and potentially in the modulation of disease progression. Thus, it will be important to determine whether Asn depletion affects the differentiation of naive CD4+ T cells into corresponding subsets under Th2 and Treg polarization conditions, as well as the expression of lineage-specific transcription factors and cytokine production.

      We appreciate that the reviewer recognizes the functional relevance of our findings showing that Asn is important for proper Th17 differentiation and promotion of EAE (Figure 5 E-J, Figure 6). Given that multiple CD4+ T cell subsets play a role in both the initiation and resolution of EAE, we agree that it would be valuable to further support these findings with complementary Th2 and Treg differentiation experiments.

      To address this, we examined the effects of asparagine depletion during in vitro iTreg and TH2 differentiation. We found that the frequencies of FOXP3+ iTreg and GATA3+ Th2 cells were reduced when cultures were grown in asparagine-deficient media. These results have been added to Supplementary Figure 5.

      (2) EAE is characterized by inflammation and demyelination in the central nervous system (CNS), leading to neurological deficits. Myelin destruction is directly correlated with the severity of the disease. For Figure 6, did the authors perform spinal cord histological analysis by hematoxylin and eosin (H&E) or Luxol fast blue (LFB) staining? This is important to rigorously examine pathological EAE symptoms.

      We agree with the reviewer that histopathology including H&E and/or LFB staining is a useful indicator of EAE disease severity. However, we are no longer able to obtain PEGAsnASE (Oncaspar) to perform these studies.

      Reviewer #2 (Public review):

      While the importance of asparagine in the differentiation and activation of CD8+ T cells has been previously reported, its role in CD4+ T cells remained unclear. Using culture media containing specific amino acids, the authors demonstrated that extracellular asparagine promotes CD4+ T cell proliferation. Consistent with this, depletion of extracellular asparagine using PEG-AsnASE suppressed CD4+ T cell activation. Proteomic analysis focusing on asparagine content revealed that, during the early phase of T cell activation, most asparagine incorporated into proteins is derived from extracellular sources. The authors further confirmed the importance of extracellular asparagine in vivo, demonstrating improved EAE pathology.

      While the data are well organized and convincing, the mechanism by which asparagine deficiency leads to altered T cell differentiation remains unclear. It is also necessary to investigate the transporters involved in asparagine uptake. In particular, elucidating whether different T cell subsets utilize the same or distinct transport mechanisms would provide important insight into the immunoregulatory role of asparagine.

      (1) The finding that asparagine supplementation promotes T cell proliferation under various amino acid conditions is highly significant. However, the concentration at which this effect occurs remains unclear. A titration analysis would be necessary to determine the dosedependency of asparagine.

      Our studies indicate that the concentration of asparagine present in conventional RPMI lymphocyte media is sufficient to support CD4+ T cell activation and proliferation in vitro (Figure 1, Supplementary Figure 1 & Figure 2). This concentration was consistently used throughout our studies. In line with the reviewer’s comments, however, we have not yet determined the dose dependency of Asn during CD4+ T cell activation.

      To address this, we performed a titration experiment in which asparagine was supplemented at varying concentrations in DMEM and Asn-deficient RPMI. Activation markers were measured 24 hours after TCR stimulation under these culture conditions. We found that the critical asparagine concentration lies between 37.8 and 3.78 uM. This concentration range is consistent with the physiological concentration of asparagine in murine plasma, which is approximately 50 uM (PMID: 24842860; PMID: 23853755). These data have been added to Supplementary Figure 1.

      (2) The effects of asparagine deficiency occur during the early phase of T cell activation. Thus, it is likely that the transporters responsible for asparagine uptake are either rapidly induced upon activation or already expressed in the resting state. Since this is central to the focus of the manuscript, it is interesting to identify the transporter responsible for asparagine uptake during early T cell activation. A recent paper (DOI: 10.1126/sciadv.ads350) reported that macrophages utilize Slc6a14 to use extracellular asparagine. Is this also true for CD4+ T cells?

      While a comprehensive characterization of the amino acid transporter network is certainly of interest, it is beyond the scope of the present study. As the reviewer notes, others have explored asparagine transport in lymphocytes. For example, Wu et al. (PMID: 33420490) determined that the asparagine transporter, Slc1a5, is significantly upregulated in CD8+ T cells upon activation, based on qRT-PCR measurements comparing mRNA from naïve and activated CD8+ T cell. They further validated the functional role of Asn transporters in CD8+ T cells by measuring N15-labeled asparagine uptake in the presence of siRNAs targeting the asparagine transporters Slc1a5 or Slc38a2 and found that inhibition of either transporter significantly reduced intracellular N15-Asn accumulation.

      To gain additional insight into Asn transporters in distinct CD4+ T cell subsets, we reanalyzed a published RNA-seq dataset (Thakore et al., 2024; PMID: 39009838). We quantified the expression of transporters Slc1a5, Slc38a2, and Slc6a14 in naïve and activated CD4+ T cells polarized under Th1, npTh17, or pTh17 conditions at various time points. We observed that Slc1a5 expression increased upon activation in all subsets. Similarly, Slc38a2 expression increased during early activation stage, but subsequently returned to basal levels similar to naïve cells. In contrast, Slc6a14 showed relatively low basal expression in naïve cells compared to the other transporters investigated, and its expression decreased over the differentiation period in all CD4+ T cell subsets examined. These results indicate that Asn transporters Slc1a5 and Slc38a2 are expressed in CD4+ T cells during early activation and differentiation. These data have been included in Supplementary Figure 3.

      (3) Given that depletion of extracellular asparagine impairs differentiation of Th1 and Th17 cells, it is possible that TCR signaling is compromised under these conditions. This point should be investigated by targeting downstream signaling molecules such as Lck, ZAP70, or mTOR. Also, does it affect the protein stability of master transcription factors such as Tbet and RORgt?

      We agree with the reviewer that asparagine deprivation could impact several aspects of T cell function. In our study, we demonstrate that asparagine is crucial for CD4+ T cell protein synthesis and the expression of activation markers (Figure 1B-K, Figure 2K-L, and Figure 3AC). We also highlight its importance in promoting CD4+ T cell subset differentiation and lineage-defining cytokine production (Figure 5B-J). Other studies have reported a role for asparagine in early activation marker expression in CD8+ T cells and in enhancing LCK function (PMID: 33822775; PMID: 33420490). Given its proposed function as a promoter of LCK signaling function in CD8+ T cells, it will be important to determine if a similar mechanism operates during CD4+ T cell activation in future studies.

      We appreciate the reviewer’s inquiry regarding the stability of critical transcription factors defining Th1 and Th17 subsets. We have examined the expression of the transcription factors RORγT and Tbet in Th17 and Th1 polarized cells and observed reduced expression in the absence of asparagine. We have included these findings in Supplementary Figure 5.

      (4) Is extracellular asparagine also important for the differentiation of helper T cell subsets other than Th1 and Th17, such as Th2, Th9, and iTreg?

      Please see our response to Reviewer 1 regarding iTreg and TH2. Investigation of Th9 cells is beyond the scope of the present study.

      (5) Asparagine taken up from outside the cell has been shown to be used for de novo protein synthesis (Figure 3E), but are there any proteins that are particularly susceptible to asparagine deficiency? This can be verified by performing proteome analysis, and the effects on Th1/17 subset differentiation mentioned above should also be examined.

      The investigation of specific proteins that exhibit asparagine dependency would indeed be interesting. Given our results showing that global protein synthesis is blunted with asparagine deprivation (Figure 3A-C), it would be particularly compelling to identify proteins with a specific requirement for asparagine. However, this level of analysis is beyond the scope of our study.

      (6) While the importance of extracellular asparagine is emphasized, Asns expression is markedly induced during early T cell activation. Nevertheless, the majority of asparagine incorporated into proteins appears to be derived from extracellular sources. Does genetic deletion of Asns have any impact on early CD4+ T cell activation? The authors indicated that newly synthesized Asns have little impact on CD8+ T cells in the Discussion section, but is this also true for CD4+ T cells? This could be verified through experiments using CRISPR-mediated Asns gene targeting or pharmacological inhibition.

      We appreciate the reviewer’s consideration of the contribution of endogenous asparagine to CD4 +T cell function. However, genetic perturbation of Asns is beyond the scope of our study, which is specifically focused on defining the requirements for extracellular asparagine and its role in CD4+ T cell activation.

    1. eLife Assessment

      This important study demonstrates that nutrient stress engenders metabolic vulnerabilities in pancreatic ductal adenocarcinoma (PDAC). By combining cell line and mouse models, the authors provide compelling evidence showing that arginine depletion from the microenvironment disrupts lipid homeostasis in PDAC resulting in ferroptosis upon exposure of tumors to polyunsaturated fatty acids. This report is likely to be of broad interest to researchers interested in studying cancer biology, metabolic adaptations and stress responses.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors set out to define how arginine availability regulates lipid metabolism and to explore the implications of this relationship in pancreatic ductal adenocarcinoma (PDAC), a tumor type known to exist in an arginine-poor microenvironment. Using a combination of rigorous genetic and metabolomic approaches, they uncover a previously underappreciated role for arginine in maintaining lipid homeostasis. Importantly, they demonstrate that arginine deprivation sensitizes PDAC cells to ferroptosis through lipidome perturbations, which can be exploited therapeutically via co-treatment with aESA and ferroptosis inducers (FINs). These findings have meaningful implications for the field. They not only shed light on the metabolic vulnerabilities created by nutrient restriction in PDAC, but also suggest a practical avenue for combination therapies that exploit ferroptosis sensitivity. This is particularly relevant in the context of pancreatic cancer, which is notoriously resistant to conventional treatments. The methods employed are broadly applicable to other nutrient-stress contexts and may inspire similar investigations in other solid tumor types.

      Strengths:

      One of the major strengths of the study is the use of complementary and well-controlled approaches-including metabolomic profiling, genetic perturbations, and in vivo models-to support the central hypothesis. The experiments are thoughtfully designed and clearly presented, and the conclusions are, for the most part, well supported by the data. The findings provide mechanistic insight into nutrient-lipid crosstalk and identify a potential therapeutic strategy for targeting arginine-deprived tumors.

      Comments on revised version:

      The authors have substantially strengthened the revised manuscript and have addressed my prior concerns, and the evidence supports the central conclusions. This work provides meaningful insight into how nutrient limitation in the tumor microenvironment creates metabolic liabilities that may be therapeutically exploited, and it should be of interest to investigators studying cancer metabolism, pancreatic cancer, lipid biology, and ferroptosis.

    3. Reviewer #2 (Public review):

      This study by Jonker et al., examines how the metabolic adaptations to the microenvironment by pancreatic ductal adenocarcinomas (PDAC) present vulnerabilities that could be used for therapeutic purposes. The evidence supporting the claims of the authors is mostly solid, and the multiplicity of models used, as well as the combination of in vitro and in vivo work are appreciated, but some conclusions would benefit from additional substantiation. This work would be of interest to biologists working on the impact of microenvironment and metabolism in cancer, and especially those investigating pancreatic cancer.

      In this study, the authors use mostly "doublings per day" as an indicator of cell death, notably for figures 4 to 6. However, proliferative arrest (or a decrease in the proliferative rate) is not necessarily synonymous with cell death. It might be nice to complement these experiments with a true measure of cell death (e.g. PI uptake).

    4. Reviewer #3 (Public review):

      This important study investigates the impact of nutrient stress in the tumor microenvironment (TME), focusing on lipid metabolism in pancreatic ductal adenocarcinoma (PDAC). Understanding TME composition is crucial, as it highlights cancer vulnerabilities independent of intracellular mutations, particularly because PDAC tumors are often exposed to limited nutrient availability due to reduced perfusion.<br /> By utilizing a medium that mimics the nutrient conditions of PDAC tumors, the authors convincingly show that TME nutrient stress suppresses SREBP1, leading to reduced lipid synthesis, with low arginine levels identified as a key driver of this suppression. Importantly, mice with arginine-starved pancreatic tumors respond to polyunsaturated fatty acid-rich diet. This discovery uncovers a synthetic lethal interaction in the tumor microenvironment that could be leveraged through dietary interventions.

      Comments on revised version:

      The authors have satisfactorily resolved all previously raised concerns through the inclusion of additional data and clarifications in the discussion.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this study, the authors set out to define how arginine availability regulates lipid metabolism and to explore the implications of this relationship in pancreatic ductal adenocarcinoma (PDAC), a tumor type known to exist in an arginine-poor microenvironment. Using a combination of rigorous genetic and metabolomic approaches, they uncover a previously underappreciated role for arginine in maintaining lipid homeostasis. Importantly, they demonstrate that arginine deprivation sensitizes PDAC cells to ferroptosis through lipidome perturbations, which can be exploited therapeutically via co-treatment with aESA and ferroptosis inducers (FINs). These findings have meaningful implications for the field. They not only shed light on the metabolic vulnerabilities created by nutrient restriction in PDAC, but also suggest a practical avenue for combination therapies that exploit ferroptosis sensitivity. This is particularly relevant in the context of pancreatic cancer, which is notoriously resistant to conventional treatments. The methods employed are broadly applicable to other nutrient-stress contexts and may inspire similar investigations in other solid tumor types.

      Strengths:

      One of the major strengths of the study is the use of complementary and well-controlled approaches-including metabolomic profiling, genetic perturbations, and in vivo models-to support the central hypothesis. The experiments are thoughtfully designed and clearly presented, and the conclusions are, for the most part, well supported by the data. The findings provide mechanistic insight into nutrient-lipid crosstalk and identify a potential therapeutic strategy for targeting arginine-deprived tumors.

      We thank the reviewer for their positive assessment of our manuscript.

      Weaknesses:

      A key weakness of the study lies in the mechanistic connection between arginine levels and SREBP1 activation. While the authors show that arginine restriction leads to reduced SREBP1 expression, the magnitude of this effect appears modest relative to the substantial changes observed in the lipidome. The study would benefit from a deeper analysis of SREBP1 regulation-particularly whether nuclear translocation or activation is affected. This could be addressed by examining the nuclear pool of SREBP1, using either subcellular fractionation or improved immunofluorescence imaging in both cell lines and tissue samples.

      We thank the reviewer for this comment and in our revised manuscript have undertaken several new studies to assess how the nuclear pool of SREBP1 is regulated by arginine starvation. We further identified one mechanism by which arginine starvation suppresses SREBP1 protein levels, namely GCN activation. We believe these additional studies strengthen the manuscript and appreciate the reviewer suggesting these studies.

      Another area where additional context would strengthen the manuscript is in the transcriptomic profiling of PDAC cells cultured in a tumor interstitial fluid mimic (TIFM). While the study emphasizes lipid-related pathways, highlighting the most significantly upregulated and downregulated pathways in Figure 1B would give readers a broader perspective on how arginine restriction reprograms the PDAC transcriptome. For instance, because polyamines are downstream of arginine and are known to influence lipid metabolism, it would be worth discussing whether these metabolites contribute to the phenotypes observed. Similarly, an evaluation of whether Dgat1/2 expression is altered could help delineate the full scope of lipid metabolic rewiring.

      We thank the reviewer for suggesting this change to our manuscript and we now provide much more extensive analysis of our transcriptomic analyses in Figure 1 – Figure supplement 1, which we think will make our manuscript more useful to readers.

      Finally, it is worth noting that the KPC mouse model used in this study is based on conditional deletion of p53, which leads to faster-growing tumors and a distinct tumor microenvironment compared to models harboring the p53^R172H point mutation. Including a brief discussion of this distinction would help readers contextualize the translational relevance of the findings.

      We have revised the manuscript to include a discussion of this point.

      Reviewer #2 (Public review):

      This study by Jonker et al. examines how the metabolic adaptations to the microenvironment by pancreatic ductal adenocarcinomas (PDAC) present vulnerabilities that could be used for therapeutic purposes. The evidence supporting the claims of the authors is mostly solid, and the multiplicity of models used, as well as the combination of in vitro and in vivo work, are appreciated, but some conclusions would benefit from additional substantiation. This work would be of interest to biologists working on the impact of microenvironment and metabolism in cancer, and especially those investigating pancreatic cancer.

      We thank the reviewer for their positive assessment of our manuscript.

      In this study, the authors use mostly "doublings per day" as an indicator of cell death, notably for Figures 4 to 6. However, proliferative arrest (or a decrease in the proliferative rate) is not necessarily synonymous with cell death. It might be nice to complement these experiments with a true measure of cell death (e.g., PI uptake).

      We thank the reviewer for this important comment and have performed extensive additional experiments to measure cell death directly via viability markers in addition to our indirect measurements of cell number at the start and end of experiments. We believe these additions strengthen our claims that PUFAs cause arginine starved PDAC cells to undergo ferroptotic cell death.

      The composition of Tumor Interstitial Fluid Medium (TIFM) was published previously, but nonetheless a reminder of the composition of this medium in a Supplemental file of this study might be helpful. In particular, at the start of the Results section, the nature of serum/lipids in the different media should be specifically noted, especially given that the subsequent focus of the work is on lipids/SREBP. It is known that differences in the extracellular availability of lipids can profoundly alter de novo lipid biosynthesis pathways.

      We thank the reviewer for this comment. We have edited the text to provide additional context on the composition of TIFM, especially lipid availability. We further have provided a supplemental file with the composition of TIFM. We hope this will make the manuscript more useful and readily interpretable for readers.

      Reviewer #3 (Public review):

      This important study investigates the impact of nutrient stress in the tumor microenvironment (TME), focusing on lipid metabolism in pancreatic ductal adenocarcinoma (PDAC).

      Understanding TME composition is crucial, as it highlights cancer vulnerabilities independent of intracellular mutations, particularly because PDAC tumors are often exposed to limited nutrient availability due to reduced perfusion.

      By utilizing a medium that mimics the nutrient conditions of PDAC tumors, the authors convincingly show that TME nutrient stress suppresses SREBP1, leading to reduced lipid synthesis, with low arginine levels identified as a key driver of this suppression. Importantly, mice with arginine-starved pancreatic tumors respond to a polyunsaturated fatty acid-rich diet. This discovery uncovers a synthetic lethal interaction in the tumor microenvironment that could be leveraged through dietary interventions.

      The conclusions of this paper are mostly well supported by data; however, below are some aspects that could be further clarified.

      We thank the reviewer for their positive assessment of our manuscript.

      This study uses PDAC cells from the LSL-Kras G12D/+ ; Trp53 ; Pdx-1-Cre PDAC model. The authors convincingly demonstrate that the cell-extrinsic stimuli of low arginine availability suppress lipid synthesis and thus exert a dominant effect over the cell-intrinsic oncogenic Ras mutation, which is known to enhance fatty acid synthesis. Could the effect of low arginine on lipid synthesis be specific for certain mutations in PDAC? It would be interesting to investigate or discuss whether different mutations show the same SREBP1 reduction caused by low arginine levels, and whether these low SREBP1 levels can be ameliorated by arginine re-supplementation. Here, Jonker et al. show that human PDAC cells cultured in TIFM have reduced SREBP1 levels (Figure 1 - Figure supplement 1C). It would be further supportive of their conclusions if the authors could show that arginine re-supplementation is sufficient to restore SREBP1 levels in human PDAC cells.

      We thank the reviewer for this comment. In response, we have now shown that arginine supplementation increases SREBP1 levels and fatty acid synthesis in human PDAC cells (Figure 2 – Figure supplement 2). Further, we have also updated the manuscript to discuss that using the LSL-Kras G12D/+; Trp53; Pdx-1-Cre PDAC model limits our ability to assess how genetic differences influence the response to arginine starvation. We additionally discuss the genetic diversity of the human PDAC cell lines used in these studies, which do include different oncogenic mutations. We believe that these results provide some data that the findings we have made regarding arginine deprivation and SREBP in our genetically defined murine PDAC cell line are applicable to human PDAC cells with more diverse oncogenic lesions.

      The authors demonstrate that mPDAC cells cultured in RPMI and subsequently implanted into an orthotopic mouse model exhibit reduced expression of SREBP target genes when compared to in vitro cultured mPDAC-RPMI cells. This finding is in line with the observation that culturing PDAC cells in TIFM downregulates SREBP target genes compared to PDAC cells cultured in RPMI. However, caution is needed when directly comparing mPDAC-RPMI cultured cells to those in the orthotopic model, as the latter may include non-tumor cells and additional factors that could confound the results. The authors should explicitly acknowledge this limitation in their study.

      We thank the reviewer for this important caveat and we have revised to text to address this point. Importantly, we note that for all comparisons between in vitro and in vivo cultures, we carefully sort malignant cancer cells from orthotopic tumors prior to analysis. We believe this approach mitigates the impact of stromal contamination on our analyses.

      The in vivo evidence demonstrating that PUFA-rich tung oil reduces tumor size is compelling. However, the specific in vitro findings regarding its impact on doubling rates per day, particularly in the context of arginine-dependent PUFA supplementation, require further explanation. To enhance the robustness of their data and conclusions, the authors could consider conducting additional cell viability and proliferation assays. Moreover, it would be valuable to assess whether the observed effects on doubling rates per day remain significant after normalizing the data to the initial doubling time prior to PUFA supplementation. This is in particular important regarding the statement that "Addition of arginine significantly decreases sensitivity to a-ESA" as these cells already start with a higher doubling rate prior to a-ESA treatment.

      We thank the reviewer for this important comment and have performed additional experiments to measure cell death directly via viability markers in addition to our indirect measurements of cell number at the start and end of experiments. Furthermore, to address the issue of different rates of cell growth in cultures affecting the response to perturbations, we also used growth rate corrected metrics (PMID: 27135972) to ensure that affects of perturbations on cell growth and viability are not confounded by the baseline proliferative kinetics of the cells under various media conditions. We believe these additions strengthen our claims that arginine starvation sensitizes PDAC cells to PUFAs.

      Overall, this paper presents a compelling study that significantly enhances our understanding of the PDAC tumor microenvironment and its complex interactions with the tumor lipid metabolism.

      We again thank the reviewer for their positive assessment of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In this study, the authors employ rigorous genetic and biochemical (metabolomic) approaches to uncover a previously unappreciated role for arginine in regulating lipid homeostasis. They further demonstrate the relevance of this pathway in pancreatic tumors, a solid tumor type often characterized by limited access to extracellular arginine. The authors present compelling evidence that arginine deprivation creates a metabolic liability, rendering tumors more susceptible to lipidome perturbations. This vulnerability can be therapeutically exploited through co-treatment with aESA and FIN to induce ferroptosis. Overall, the conclusions are convincing, the manuscript is well-written, and the figures are clearly presented.

      We again thank the reviewer for their positive assessment of our manuscript.

      The key weakness of the study lies in the mechanistic link between arginine levels and SREBP1 expression. While the data support the authors' argument, the observed changes in SREBP1 expression following arginine restriction appear modest relative to the more pronounced changes in the lipidome. To strengthen this connection, the authors may consider performing cellular fractionation to focus their analysis on the nuclear (active) pool of SREBP1. Improved immunofluorescence imaging and quantification of nuclear SREBP1 levels in tissues would also provide additional support for their model.

      We thank the reviewers for this helpful comment. To strengthen this study, we both examined the nuclear levels of SREBP1 in TIFM cultured cells and worked to identify the mechanistic link connecting arginine levels of SREBP1 expression.

      First, we found that arginine starvation does not lead to nuclear exclusion of SREBP1. We believe this finding strengthens our conclusion that arginine starvation regulates SREBP1 at the level of protein expression. We do agree with the reviewer that the change in SREBP1 protein level is modest, but we do show the effects of arginine on PDAC cell lipid metabolism are SREBP1 dependent (Figure 3O-P, Figure 5F, Figure 5 – Figure supplement 2D). Thus, we interpret these data that even the relatively modest change in SREBP1 protein levels are sufficient to cause large changes in the output of this transcription factor and the cellular lipidome.

      Second, we determined if the arginine-responsive GCN2 signaling pathway, which is known to regulate SREBP1, could contribute to the suppression of SREBP1 observed in PDAC cells. We found that GCN2 signaling is activated in PDAC cells in TIFM culture by arginine starvation and is active in animal tumors. We further found that activation of GCN2 is in part responsible for suppression of SREBP1, which is consistent with prior literature describing a role for GCN2 activation in suppressing SREBP1 translation (PMID: 17276353). Thus, while other mechanisms are at play in transducing arginine starvation to reduced SREBP1 protein levels, we have identified one mechanism (activation of GCN2) by which arginine starvation suppresses SREBP1, leading to the lipidomic changes we observed upon starvation of this amino acid.

      In addition, it would be helpful for the authors to highlight the most significantly upregulated and downregulated pathways in Figure 1B to give a more comprehensive view of transcriptomic changes in PDAC cells cultured under TIFM conditions. For example, since polyamines are downstream of arginine and known to regulate lipid metabolism, could some of the observed effects be attributed to changes in polyamine levels? Similarly, do arginine levels affect the expression of Dgat1 or Dgat2?

      We have added an additional Figure supplement to Figure 1 that include a comprehensive list of up- and downregulated gene sets in PDAC cells cultured in TIFM via GSEA analysis. We also added additional KEGG metabolic pathway analysis via GATOM (PMID: 35639928). We hope these additions will be useful for readers and point their attention to other metabolic pathways that are significantly altered by nutrient stress, such as the TCA cycle and oxidative phosphorylation, beyond those related to lipid metabolism that we investigated here.

      From this analysis, we did not specifically note strong changes in the expression of polyamine metabolic enzymes or DGATs.

      Finally, the KPC model used in this study involves conditional deletion of p53, which is known to produce tumors with a faster progression and a distinct tumor microenvironment compared to the more commonly used p53^R172H knock-in model. Including this point in the discussion would help contextualize the findings.

      We thank the reviewers for mentioning this limitation of our study. In the results section of the test, we now included a discussion of the limitations of the mouse model used in the discussion of the work. We also highlight in the text now that in addition to our studies using the murine p53 deletion model that our studies make use of human PDAC lines that contain p53 mutations. We believe that these results provide some data that the findings we have made regarding arginine deprivation and SREBP in our genetically defined murine PDAC cell line are applicable to human PDAC cells with more diverse oncogenic lesions.

      Minor comments to improve clarity:

      (1) In Figure 3C, it would be helpful to annotate the PE-linked TG for clarity.

      We do not understand exactly what PE-linked TGs refers to. We note in Fig. 3C that ether-linked triglycerides are labeled in orange and annotated as O-TG and vinyl ether-linked triglycerides are labeled in grey and annotated as P-TG.

      (2) Is Figure 3P mislabeled? Both conditions are labeled as +Arg / -lipid.

      We thank the reviewers for pointing out this mistake in the figure and have updated it to correctly label these samples as sgSREBP1 and sgNTG transduced PDAC cell lines.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1B: Misspelling in Y axis "Normalized enrichment score".

      We thank the authors for catching this mistake and have corrected this error.

      (2) Figure 1B: Could the authors elaborate on why they decided to focus specifically on these three hits, which are not the most downregulated genes (the "top hits") appearing in the GSEA?

      We chose to focus on lipid metabolism as multiple transcriptomic analysis tools, namely GSEA and GATOM, which specifically focuses on enrichment in KEGG annotated metabolic pathways, highlighted lipid synthesis as being the most transcriptionally regulated metabolic pathway in TIFM. To make this apparent to readers, we added an additional Figure supplement to Figure 1 that includes a comprehensive list of up- and downregulated gene sets in PDAC cells cultured in TIFM from GSEA and GATOM analysis. We hope these additions will make the logic for our focus on lipid synthesis clear and will be useful for readers in highlighting other metabolic pathways that are significantly altered by nutrient stress, such as the TCA cycle and oxidative phosphorylation.

      (3) Figure 1: It might improve the clarity of the text if the three pairs of murine cell lines (mPDAC1, mPDAC2, mPDAC3) were introduced in a bit more detail in the main text and not just in the figure legend.

      We have added more detail describing the three mouse cell lines used in the main text.

      (4) Figure 1E: The authors may wish to comment on why they chose to perform transcriptomic analyses with the mPDAC3 derived models, and not mPDAC1 or mPDAC2, given that mPDAC3 appears to exhibit the most distinct phenotype of the three, according to the results presented in Figure 1 J-L.

      The transcriptional analysis described in Fig. 1E was performed on a previously acquired dataset using mPDAC3 cell lines (PMID: 37254839), which is why this line was used. We have revised the text to make it clear that this transcriptional analysis uses pre-existing data from a previous publication.

      (5) Figure 1L: The authors may wish to clarify why they only show relative palmitate to assess global fatty acid biosynthesis in these cell lines. There is a decrease in labeled palmitate of mPDAC3 cells cultured in TIFM in comparison to the cells cultured in RPMI media, showing a decrease in the lipid biosynthesis of these cells in these conditions. However, there also seems to be lower palmitate levels in the TIFM-cultured mPDAC3 cells specifically, in comparison to their mPDAC1 and mPDAC2 counterparts. Why is that? Could the authors comment on this result?

      We thank the reviewers for this helpful observation. In Figure 1L (now Figure 1N), we wanted to show how culture conditions (RPMI/TIFM) affected both the total amount of palmitate in PDAC cells but also the fraction that is labeled (i.e. arising from de novo synthesis). We think this provides more information for readers by allowing them to assess both changes in pool size of palmitate and changes in the fraction of palmitate that is synthesized. We like this presentation as it shows clearly that while total palmitate levels behave differently across cell lines (with TIFM culture reducing levels in mPDAC1-2 but increasing levels in mPDAC3) the amount of palmitate that is synthesized de novo is decreased in all three cell lines when cultured in TIFM. To highlight this, we also present the fraction of palmitate that is labeled in Fig. 1O.

      We are unsure why TIFM culture reduces total palmitate levels in some PDAC cell lines, while others are able to maintain total palmitate pools. We assume that TIFM cultures increase lipid uptake to compensate for lack of synthesis, and potentially differences in lipid scavenging capacity between the lines could explain this difference. We are currently working on experiments to test these hypotheses and will present the results in a future study.

      (6) Figure 2 - Figure Supplement 1A: It would be informative and appreciated to know which nutrients are actually represented and correspond to certain points on the graph, in particular for the ones that are the most differentially present in the two different media.

      We have now updated this graph to highlight key metabolites that are most differentially abundant between the two media. We also now provide as a Supplementary file the composition of TIFM, which provides readers with all the information needed to understand which metabolites are differentially abundant in TIFM and any media they wish to compare.

      (7) Figure 2 - Related to Figure supplement 1D: It would be useful to know how or why arginine was selected for further investigation from the subset of amino acids. The authors could elaborate on this, by showing or highlighting the data that drew attention to this amino acid initially.

      We thank the reviewers for this note. We have tried to make Figure 2 – Figure supplement 1 more clear as to how arginine was selected for further investigation. We have updated the figure to improve clarity for the comparisons of different media that enabled us to identify differences in amino acids between RPMI and TIFM as driving the difference in lipid metabolism. We have also highlighted in Figure 2 – Figure supplement 1A that arginine is the most differentially abundant amino acid and editing the text to explain the logic that this high degree of differential abundance is why we focused on arginine amongst all the amino acids as a likely candidate for regulation of SREBP1.

      (8) The legends for Figures 2G and 2H could be improved, i.e., making clearer that 2H shows incorporation in the circulating fatty acids, unlike 2G.

      We have updated the figure with improved labeling as the reviewer suggested to denote which panels correspond to which sample type.

      (9) Figure 3E and 3G: The heatmaps displayed here show that the addition of arginine to TIFM culture medium restores fatty acid synthesis; however, it appears that the nature of the lipids synthesized in this condition may differ from the ones synthesized in RPMI cultured conditions.

      We have added additional text highlighting that arginine supplementation to TIFM and RPMI culture led to induction of different SREBP1-target genes, but that both lead to activation of fatty acid synthesis and desaturation genes, which contributes to the focus of our study on de novo synthesis of saturated and monounsaturated fatty acids in the study.

      (10) Figure 3O: The SREBP1 immunoblot still seems to show some residual bands for the cells transduced with SREBP1 targeting sgRNAs, therefore, the authors may want to be more nuanced and present this model as a KD, instead of a KO, as mentioned in the text?

      We agree with the reviewer’s suggestion, and we have changed the text to describe these as knockdowns rather than full knockouts.

      (11) Figure 3P: Is it possible that there is an error in the legend of the figure (Lipids + for the first bar and - for the second one?). The figure could also be improved by a legend that explains what the different colored bars represent.

      We thank the reviewers for pointing out this mistake in the figure and have updated it to correctly label these samples as sgSREBP1 and sgNTG transduced PDAC cell lines.

      (12) Figure 4: The authors are stating in Figure 4 - Figure supplement 1A-F, that argininerestricted mPDAC cells are not sensitized to xCT or GPX4 inhibitors that trigger ferroptosis and that therefore SREBP1 suppression by arginine restriction in the TME does not sensitize PDAC cells to ferroptosis inducers. However, this does not appear to be so clear with the data shown. This might be due to the limitations associated with the population doubling measurements instead of the lethality measures noted above. Likewise, later it is proposed that arginine restriction sensitizes both mPDAC cells and human PDAC cells to α-ESA induced ferroptosis. These results would benefit from a direct measure of cell death. Related to the above point, it would be useful to better understand why cells cultured in arginine-deprived TIFM do not appear to be sensitized to ferroptosis inducers, but these same cells die from ferroptosis when treated with α-ESA. It would be useful to present some thoughts.

      We thank the reviewers for bringing up this important point. To the reviewers first point, we repeated xCT and GPX4 inhibitor treatment experiments to include both growth corrected (PMID: 27135972) proliferation assays and Sytox-based viability assays. In both cases, we did not find consistent sensitization to xCT or GPX4 inhibitors across multiple PDAC lines when cultured in TIFM. In contrast, we found consistent sensitization to PUFA treatment across multiple murine and human PDAC cell lines cultured in TIFM. Together, this analysis suggests that arginine starvation specifically sensitizes PDAC cells to PUFAs, but not other ferroptosis inducers.

      We agree with the reviewer that this is an interesting and unexpected observation. We do not have a mechanistic understanding as to why this is the case. However, we believe this is quite interesting and suggests that PUFAs maybe a better method of inducing ferroptosis in certain conditions than other ferroptosis inducing approaches. We have added text to the discussion to highlight this interesting and unexplained observation.

      (13) Figure 6: The authors mention that α-ESA is used here at sublethal doses, which do not affect viability or proliferation, but this is not shown in either the main or supplementary data. These data should be provided somewhere. It might also be nice to mention in the main text (not just in the legend) the dose of α-ESA used for the combination treatments.

      We thank the reviewers for this helpful suggestion. To illustrate that α-ESA is used at a sublethal dose, we altered each panel to be on a linear rather than logarithmic x-axis, therefore including the DMSO control arm for each ferroptosis inducer in combination with α-ESA. We hope this now clearly illustrates that this dose α-ESA is not perturbing cell growth or viability in these assays.

      (14) Figure 6B: Fer-1 treatment does not seem to rescue the phenotype very clearly. This could again be because cell death is being conflated (to degree) with effects on proliferation, and Fer-1 is not expected to affect cell proliferation. Again, measuring cell death directly would be better than measuring population doublings.

      We thank the reviewers for this helpful comment. To address this concern, we have added Sytox-based viability assays to figure 6. These assays indicate that Fer-1 treatment rescues the viability of PDAC cells treated with ferroptosis inducers, α-ESA, or the two in combination.

      Reviewer #3 (Recommendations for the authors):

      General notes:

      (1) It would be easier for the reader if one condition were consistently placed in the same position throughout the graphs. For example, RPMI results should always appear first and TIFM second. Currently, this is inconsistent throughout the manuscript (e.g., Figure 1 - Figure Supplement 1: RPMI is first and TIFM second; Figure 2 - Figure Supplement 1: TIFM is first and RPMI second).

      We thank the reviewers for this note. We have updated the figures to remain consistent in their ordering throughout the manuscript.

      (2) Please briefly explain the differences between PDAC1-3 and clarify why most follow-up experiments were conducted using PDAC1. Presumably, this was because PDAC1 showed the most robust effect on fatty acid synthesis.

      We have added additional text in the results section of the manuscript describing the different murine PDAC lines used in this study. We performed most studies with mPDAC1 as this line has robust differences in fatty acid synthesis between culture conditions. However, murine PDAC lines recapitulate the transcriptional subtype diversity of PDAC (PMID: 29364867), so we critically repeat key experiments in multiple mPDAC lines to determine if a given finding is translatable to other PDAC subtypes.

      (3) Are only SREBP1 protein levels affected or are SREBP1 RNA levels also decreased in low arginine TME?

      We appreciate this important comment. We have added SREBP1 RNA levels to Figure 1 to show that RNA levels do not differ between conditions, whereas protein levels of SREBP1 change significantly.

      (4) What was the rationale for investigating lipid metabolism even though it was not the top changed metabolic gene signature? It would be interesting to briefly discuss which pathways were the most enriched.

      We chose to focus on lipid metabolism as multiple transcriptomic analysis tools, namely GSEA and GATOM, which specifically focuses on enrichment in KEGG annotated metabolic pathways, highlighted lipid synthesis as being the most transcriptionally regulated metabolic pathway in TIFM. To make this apparent to readers, we added an additional Figure supplement to Figure 1 that includes a comprehensive list of up- and downregulated gene sets in PDAC cells cultured in TIFM from GSEA and GATOM analysis. We hope these additions will make the logic for our focus on lipid synthesis clear and will be useful for readers in highlighting other metabolic pathways that are significantly altered by nutrient stress, such as the TCA cycle and oxidative phosphorylation.

      Further comments:

      (1) Figure 1 Supplement 1A: It is not clear which SREBP target genes are significant. Please indicate this more clearly.

      The analysis in this section was done on expression level of all the indicated genes between groups (tumor/normal) rather testing for significance of individual genes between the two groups. We have updated both the text and the figure legend to clarify this as the statistical analysis that was performed.

      (2) Figure 1J and 2C: The Western blot loading control (Actin) does not appear equal across all samples. It would be helpful to include a quantification normalized to the Actin loading control.

      We have included quantification of each western blot to help interpret these immunoblots.

      (3) Supplementary Figure 2: How often has this experiment been performed? The TIFM results appear to consistently show the same values. If this is the case, it needs to be labeled appropriately.

      Thank you for pointing out that how we presented the data was confusing as to how the experiment described was performed. Initially, we performed multiple separate experiments to identify arginine starvation as the TIFM-driver of SREBP1 suppression. To compare across all the separate media conditions, we performed one experiment with all the relevant media conditions together, which is the experiment that is described in the manuscript. Thus, there was one set of control TIFM/RPMI conditions to which we compared all of the different media conditions. As we initially presented the data, it appeared as if we had performed multiple experiments in which the TIFM/RPMI controls had exactly the same behavior, which is not the case. We have updated the data presentation in this figure to make it clear that this was the experimental design for the data presented.

      (4) Figure 3P: Please add a legend for this panel.

      We thank the reviewers for point out this mistake in the figure and have updated it to correctly label these samples as sgSREBP1 and sgNTG transduced PDAC cell lines.

      (5) Figure 4 - Figure Supplement 1: Please review the legend carefully. The legend currently includes only circles, but some of the graphs (A and F) display squares.

      Thank you for catching this mistake. We have updated the panels and legends for this figure so they are concordant.

      (6) Figure 4D: The effect of a-ESA treatment on the doubling delta of arginine-treated versus non-treated TIFM cells looks similar. It looks like the difference is because cells treated with arginine start at higher doubling values from the beginning. I would suggest looking at the delta and subsequently tone down the statement: "Addition of arginine significantly decreases sensitivity to a-ESA."

      Thank you for this helpful comment. To avoid any confounding effects of differences in basal growth rate between mPDAC cells grown in different media, we have converted all of our data to GR values as described in (PMID: 27135972) which enables us to take into account the basal growth rates of cultures when calculating the effects of treatments/perturbations on culture growth and viability. We hope this addition makes the effect that arginine has on α-ESA sensitivity clear beyond the impact that arginine has on basal growth rate.

      In addition, we also measured the viability of α-ESA treated mPDAC cells with and without supplemental arginine (current Fig. 5E) by Sytox-exclusion assay. We believe this new data supports the claim that arginine makes PDAC cells resistant to the addition of exogenous PUFAs.

    1. eLife Assessment

      This important study provides a quantitative comparison of how zebrafish and medaka larvae process visual motion, revealing clear differences in how they integrate information across space and time. The evidence is convincing, combining a broad set of behavioral assays with response decomposition and mechanistic modeling that together support the central conclusions. Some aspects remain incomplete, particularly the link between the spatial and temporal findings, the extent to which the model accounts for the full range of behavioral results, and the framing of broader evolutionary or social interpretations. Overall, the work offers a careful and informative analysis that should be of broad interest to researchers studying visual processing, sensorimotor computation, and comparative neuroscience.

    2. Reviewer #1 (Public review):

      Summary:

      This study investigates how two closely related fish species differ in their processing of visual motion, with a focus on spatial and temporal integration underlying behavior. Using a series of behavioral assays combined with computational modeling, the authors identify clear species-specific differences in how visual information is integrated to guide movement.

      Strengths:

      A major strength of the work is the systematic and quantitative behavioral analysis, which reveals robust differences between species, including broader spatial integration and longer temporal persistence in medaka compared to zebrafish. The decomposition of behavior into distinct components provides a useful framework for interpreting these differences.

      Weaknesses:

      The computational modeling captures several key aspects of the observed temporal dynamics, particularly differences in response persistence. However, the modeling framework is primarily focused on temporal processing and does not incorporate spatial integration, which is a central finding of the study. In addition, some experimental observations, such as responses to short-duration stimuli and certain frequency-dependent features, are only partially reproduced. These limitations indicate that the link between the model and the full range of behavioral results remains incomplete.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents a comparative analysis of optomotor behavior in zebrafish and medaka larvae. Using multiple behavioral paradigms, the authors argue that the two species differ in both the spatial and temporal integration of visual motion. They further decompose turning behavior into large- and small-turn components and use a simple mechanistic model to capture several of the main response features. Overall, the study addresses an interesting question, and the comparative framework gives the work a clear conceptual appeal.

      Strengths:

      A major strength of the manuscript is the breadth of the behavioral analysis. The authors use several stimulus paradigms to probe spatial extent, temporal persistence, and response dynamics, which makes the cross-species comparison richer and more informative than a single-assay study. The decomposition into large and small turn components is also a useful feature of the work, as it provides a more structured account of where the species differences may arise. The modeling further helps organize the results and offers a useful framework for interpreting the behavioral differences.

      Weaknesses:

      The main limitations are in presentation and clarity rather than in the overall motivation or approach. In several places, it is difficult to determine exactly how some quantities are summarized statistically, and some figures and legends would benefit from clearer explanations. In addition, a few of the more specific interpretive claims would be strengthened by more explicit statistical framing and slightly clearer presentation. These issues appear addressable and do not detract from the overall interest of the study.

    4. Author response:

      We appreciate the constructive feedback from the reviewers and are currently working diligently to address all concerns raised in both the public reviews and the recommendations for the authors. Below, we outline the revisions planned for the revised manuscript.

      (1) We acknowledge the limitations of the current modeling framework regarding spatial integration, and we agree that the present model does not account for the short lifetime of the dot stimuli.

      For spatial integration, our current data suggest a relatively narrow, center-weighted integration function in zebrafish, compared to a broader integration function in medaka. While incorporating such spatial weighting into the model would improve its completeness, we do not expect it to substantially alter our current interpretation of the underlying mechanisms.

      Regarding the responses to short-lifetime dot stimuli, we hypothesize that medaka may possess local retinal receptive units that function as low-pass filters, as illustrated schematically in Figure 3e. At present, however, we believe that explicitly modeling this component would remain largely uninformative and would not substantially increase the explanatory power of the model.

      In the revised manuscript, we will discuss these limitations and the possible neural implementations more explicitly in the Discussion section.

      (2) We appreciate the reviewer’s comments regarding the clarity of data presentation and statistical descriptions.

      In the revised manuscript, we will improve the clarity of the figures and legends and provide more explicit explanations of the statistical analyses and summary metrics used throughout the study. We will also revise several sections of the text to improve the framing and interpretation of the results.

    1. eLife Assessment

      This study presents a large-scale characterization of single-neuron responses during reading and listening, enabling examination of both 'low-level' (orthographic/phonological) and 'higher-level' (syntactic) features, as well as links between single-neuron activity and multi-scale field potentials, making it a valuable resource for bridging micro- and macroscale accounts of language processing. The analyses identify modality-specific and putatively modality-independent responses across distributed brain regions, offering an intriguing framework for understanding how sensory-specific and abstract representations may relate. However, the evidence supporting the central claims is currently incomplete, due to limited population-level quantification, insufficient statistical characterization of how many neurons encode the relevant features, ambiguity in the interpretation of encoding model results, and a lack of rigorous tests of cross-modal generalization and alternative accounts, which together weaken the conclusions about amodal representations and hierarchical processing.

    2. Reviewer #1 (Public review):

      Summary:

      This paper presents rare and unique recordings of single neurons, LFPs, and SEEG data from human patients performing reading and listening tasks. They identify single neurons in temporal and ventral occipito-temporal cortex that respond specifically to spoken and written language, and primarily encode either phonological or orthographic features of the stimuli. They also identify neurons in the middle temporal and inferior frontal cortex that respond to both modalities, which they interpret as amodal language responses. In general, neuronal population firing rates are correlated with both micro- and macro- scale broadband gamma responses, though they observe some dissociations, particularly with the macro-scale. The results are interpreted to support a model of modality-specific to amodal processing throughout many distributed brain areas for language.

      Strengths:

      (1) The data are truly unique, providing a large-scale characterization of single neuron responses from the human brain during written and spoken language processing.

      (2) The task and stimulus conditions allow for examination of both low-level (e.g., orthographic/phonological) and higher-level (e.g., syntactic) encoding.

      (3) Showing relationships between single neuron and multi-scale LFP recordings from the same sites helps bridge neuronal and meso/macroscale literatures.

      Weaknesses:

      (1) My main comment about the paper is that it feels like a collection of somewhat random descriptions of a very small number of hand-picked single neurons. I think that the task and stimulus design shown in Figure 1A sets up some clear hypotheses that could be tested rigorously across the full neuronal population, but instead, the authors pick a few neurons and fit encoding models that don't take advantage of the contrasts. I agree that encoding models are a powerful approach, but with only 508 total words and what appears to be a limited set of variability across the various features, it's not clear to me that the stimuli, which were apparently designed as minimal pairs, provide enough power to find robust results. Perhaps this is why the majority of the results only show a very small number of units (most of which are actually buried in the supplement), but it's odd to me that they don't show the results of the minimal contrasts other than for length.

      (2) Related to point (1), other than Figure 2H and Figure 6A-B, the results are only shown for a tiny number of units. This is great for demonstrating qualitatively what the effects look like, but there is no quantification of the findings across the population, which undermines the point in the abstract that 1000 neurons were recorded. This is acknowledged in some places, but as a reader, it leaves me wondering how seriously to take the interpretations if they seemingly cannot be replicated. I understand this is a challenge with human single neuron recordings, but as presented, the paper as a whole comes across as largely anecdotal.

      (3) Some of the key claims rest on the idea that neurons were recorded from the superior temporal gyrus and fusiform gyrus. For the STG claim, I don't understand how this was done, or what specifically they mean by STG, since the microwire locations do not appear to be anywhere near the lateral surface. This makes sense given the profile of the Behnke-Fried electrodes, but if they want to claim that there are neurons from the STG, they need to be more specific and show where precisely these wires are. If they are more medial as it appears, they need to explain how they dissociated STG from Heschl's gyrus. Similarly, for the fusiform neurons, I can only see a couple of probes that appear to have their tips near where I would think this area is. Perhaps this is more of a visualization issue with Figure 1F, but overall, I am not convinced that the neurons are exactly where they say they are.

      (4) Related to point (3), some of the authors have made strong claims in prior work about the precise coordinates of the VWFA, so it would help to know how many units are within this exact region. The ROIs marked in Figure 2 are quite large, and given results like Vinckier et al. 2007, it's important to know where along the hierarchy the recordings were actually performed. Similarly, given the framing in the intro around the VWFA as a key area, the idea that some of the best example neurons are from the right fusiform is a bit confusing. I don't think they can make the claims about visual hemifields since it does not appear that they recorded eye tracking to verify constant central fixation, and it may be a bit surprising to see such strong orthographic selectivity in the right hemisphere (though, as a result, it may suggest a more nuanced view of lateralization of reading at the single neuron.

      (5) In many sections of the paper, there are vague and unquantified claims like "many neurons" or "a large number of units". This needs to be made explicit. It would also help to show where statistical threshold cutoffs are on plots like Figure 2H, since the "brain-score" is used to select units for many analyses.

      (6) More detail on the TRF models is needed in the methods. At the very least, a complete list of the features in each group is necessary to evaluate claims about very broad sets of features like "syntax". It would also help to know how the features were coded, especially where there is a mixture of continuous and discrete features within the model.

      (7) Depending on how exactly the features were defined, I'm skeptical of some of the claims, like position-specific "w". There are some obvious confounds that need to be controlled here, like whether word-initial "w" is strongly associated with shorter, higher frequency words (like "wh-" words). There are other examples, like whether specific forked letters tend to appear in certain syllables in English words. While it may be the case that these kinds of patterns are uniformly distributed, it needs to be established in this particular stimulus set.

      (8) The claim that there is monotonic encoding of word length does not seem strongly supported in the data. In both PC1 and the single neuron examples, it seems like there may be a non-linear relationship, which could suggest that another correlated feature (e.g., word frequency) is involved.

      Minor Points:

      (1) What are "boundaries"? They are not described anywhere I could find, but they are a feature group that was used in the TRFs. )

      (2) The caption for Figure 6C says MTG and insula, but the text says MTG and IFG. Similar to the above comment about STG and fusiform, it's not clear to me how they achieved single-unit recordings with Behnke-Fried probes in these areas.

      (3) The somewhat less robust correlations between firing rate and BGA in macro vs micro contacts are potentially interesting. However, did they verify that the closest macro contact was always in the gray matter of the same gyrus as the microwire?

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript, "Modality-Specific and Amodal Language Processing by Single Neurons," presents an intracranial electrophysiology study investigating how language is represented in the human brain across spoken and written modalities. The authors analyze activity from over one thousand single neurons and local field potentials recorded in twenty-one neurosurgical patients while participants read and listened to sentences. Using encoding models based on temporal receptive fields, they examine whether neural responses track modality-specific features, such as phonological and orthographic information, as well as higher-level linguistic features. The results are interpreted as evidence for a dissociation between modality-specific processing in sensory regions and modality-independent ("amodal") representations in temporal and frontal cortices, supporting a two-stage model of language processing.

      Strengths:

      This study uses a rare and valuable dataset, combining single-neuron recordings with broader field potential measures in human participants. The large-scale recording, in terms of both neuron count and anatomical coverage across multiple regions and individuals, represents a significant technical achievement for intracranial research.

      The use of encoding models to relate neural activity to multiple levels of linguistic representation is methodologically rigorous and provides a unified framework to compare phonological, orthographic, and higher-level features. This approach allows the authors to systematically test how different aspects of language are represented across neurons and regions.

      Another key strength is the attempt to directly link concepts from Linguistics to neural data. By framing the results in terms of modality-specific versus amodal representations, the study engages with longstanding theoretical questions and offers a potential bridge between linguistic theory and systems neuroscience.

      The manuscript is also very well written, and the data are presented clearly and effectively. The inclusion of raw data and raster plots is particularly valuable, as it allows readers to directly assess the neural responses and strengthens the transparency of the analyses.

      Weaknesses:

      Despite these strengths, the central claims of the paper are not fully supported by the analyses presented, and several key issues limit the strength of the conclusions.

      A primary concern is the lack of clear reporting and statistical characterization of the proportion of neurons that significantly encode the tested linguistic features. While the paper presents illustrative examples and regional patterns of encoding, it does not systematically quantify how many neurons exhibit significant effects across conditions, nor does it provide formal statistical comparisons of these proportions across brain regions or feature types. As a result, it is difficult to determine whether the reported dissociations reflect robust population-level phenomena or relatively sparse subsets of neurons identified through model fitting. Figure 2H offers a visual depiction of the distribution of Brain-Score (a measure of model evaluation) across the fusiform gyrus and superior temporal gyrus, but it falls short of providing formal statistical testing or quantitative summaries, limiting its interpretability in supporting the authors' claims. Given that the authors employ temporal receptive field (TRF) analyses, the framework naturally allows for straightforward quantification of the proportion of neurons that significantly encode any linguistic features in the model, which could be reported by region as well as by stimulus condition (auditory vs. visual). Including such analyses would further strengthen the population-level interpretation of the results.

      Relatedly, the interpretation of "amodal" neurons is not sufficiently substantiated. The classification of neurons as modality-independent relies on encoding model performance across conditions, but the statistical criteria for establishing cross-modal generalization are not always clearly defined or rigorously tested. Without explicit comparisons (e.g., testing whether the same neurons significantly encode features in both modalities above chance, and whether this exceeds what would be expected under appropriate null models), the claim of modality-independent representation remains somewhat underdetermined.

      More generally, the reliance on encoding models introduces some interpretational ambiguity. Although the observed dissociation between fusiform and superior temporal regions is consistent with orthographic and phonological processing, respectively, the feature spaces used in the models are partially linked to lower-level sensory properties (e.g., visual form and acoustic features). The authors' single-neuron results suggest these effects reflect genuine linguistic selectivity, but the findings do not uniquely distinguish between linguistic and perceptual explanations. While fully disentangling these factors may be beyond the scope of the current study, the manuscript could benefit from a brief discussion acknowledging these correlations or clarifying how lower-level sensory contributions were considered.

      Another limitation is that the proposed two-stage model of language processing is not directly tested against competing hypotheses. While the dissociation between modality-specific and amodal representations is consistent with this model, the authors note that higher-level features, such as syntax, may be encoded in a distributed or overlapping manner. These possibilities are not systematically tested, so the conclusions risk overinterpreting correlational patterns as evidence for a specific processing hierarchy. A more explicit discussion or quantitative consideration of these alternative accounts would strengthen the interpretation, while still allowing the two-stage model to be presented as a plausible framework.

    4. Reviewer #3 (Public review):

      Summary

      This paper analyzes human single-neuron activity recorded with Behnke-Fried electrodes during naturalistic listening and reading. The authors demonstrate a double dissociation between superior temporal gyrus neurons (responsive during listening but not reading) and fusiform gyrus neurons (responsive during reading but not listening), and report that these two classes of neurons show selectivity to specific phonological and orthographic features of the stimulus, respectively. Across the language network, the authors also report neurons whose responses are amodal (active during both listening and reading), which they organize into a modal-to-amodal processing hierarchy. A separate thread of analyses tracks the relationship between single-neuron spiking, micro-wire, and macro-wire signals across these regions. The authors interpret their findings as evidence for hierarchical processing across the language network and for a "compositional code" for orthography in reading.

      Strengths

      The dataset is rare and valuable. Simultaneous single-neuron, micro-wire, and macro-wire recordings during naturalistic reading and listening in the same patients are difficult to obtain, and the experimental design reflects substantial care. The cross-modality comparison at single-neuron resolution is a novel measurement, and the paper presents these results while also situating them against prior neuroimaging and intracranial work. The simultaneous availability of signals at three spatial scales within the human language network is an unusual and potentially important resource for the field.

      Weaknesses

      (1) Framing and novelty

      The paper appropriately situates its modality-selectivity findings against prior neuroimaging and intracranial work (citing Buchweitz et al. 2009 among others) and frames its novel contribution as bringing single-neuron resolution to a question that has previously been examined at population scales. This framing is fair as far as it goes. However, two issues remain. First, the paper does not engage with neuroimaging evidence that complicates its clean modality-selectivity story - most notably Wilson, Bautista, & McCarron (2018), who found that the dorsal superior temporal sulcus is activated by both intelligible and unintelligible inputs in both modalities. Several reconciliations of single-neuron modality selectivity with population-level cross-modal activation are possible (sparse coding, BOLD-vs-spiking dissociations, etc.), and the paper should engage with these possibilities. Second, the paper's discussion extends well beyond the modality-selectivity result that is its headline contribution, into broader claims about a "compositional code" for orthography and "hierarchical processing" across the language network. These broader claims are not supported by the analyses presented (see Weakness 3), and their inclusion distracts from and weakens the core finding rather than building on it. The paper would be stronger if these claims were either subjected to the population-level analyses they require or scaled back to exploratory observations.

      These framing issues are compounded by writing problems that obscure what the paper is claiming. Some passages, such as the assertion that the dataset "suggests an unprecedented examination of linguistic features across various brain regions at various resolutions," are not interpretable as written and should be rewritten.

      (2) Methodological concerns about the TRF analyses

      The selectivity findings in Figures 3 and 5 rest on temporal response function / temporal receptive field (TRF) analyses with several core issues.

      2.1) First, the construction of the TRF feature stream for the reading condition is not specified in the methods. Reading stimuli are presented in RSVP, with all letters of a word appearing simultaneously. How letter or letter-position features are mapped to a time-varying regressor reflects a substantive hypothesis about the psychological mechanisms of reading, with statistical consequences for what the TRF can recover and how reading and listening analyses can be compared.

      2.2) Second, the stimulus distribution limits which effects can be reliably estimated. While the design appears balanced for some features (e.g., subject gender and number), the features that drive the TRF analyses - particularly letter identity and position in the orthographic TRF - are unlikely to be well covered in a small stimulus set. This raises a concern about high-variance feature importance estimates.

      2.3) Third, the TRF feature set includes syntactic, semantic, and discourse predictors alongside phonological and orthographic features. The paper does not justify this choice in fitting single-neuron responses in STG and FSG, and the consequences for the unique-variance analyses are not discussed. Because syntactic features are correlated with phonological and orthographic features in natural stimuli (function words are short, have characteristic phoneme distributions, and so on), the unique variance attributed to each feature set depends on what is being controlled for. Including syntactic predictors when fitting STG or FSG neurons also risks inflating overall TRF fit by chance, particularly in the absence of cross-neuron correction.

      2.4) Fourth, there seems to be no correction for multiple comparisons across the neuron × feature grid. The within-neuron feature-importance procedure briefly described in the Figure 3 caption may help combat overestimates of feature importance within a single fit, but does not address the question of how many of the "selective" neurons reported across the paper would survive correction at the population level. With many neurons, many features, and a limited stimulus set, some neurons will appear selective to some features by chance alone, and these are likely to be the ones that appear as example panels in figures.

      Together, these issues mean the per-feature selectivity results cannot be interpreted as the paper currently interprets them. This is consequential because the per-feature selectivity findings underpin the paper's broader claims about a compositional code for orthography and about hierarchical processing across feature levels.

      (3) Claims that outrun the evidence

      Several of the paper's broader claims are not supported by the analyses presented.

      3.1) The authors claim a "compositional code" for orthography, in which single neurons code for the combination of letter identity and position. This claim is illustrated with two example neurons. A claim about a coding scheme is a population-level claim and requires a population-level analysis. A natural test would be a per-neuron model comparison between a TRF with letter identity alone and a TRF including letter identity × position interactions, controlled for model complexity, asking how many neurons show improved prediction with the interaction features. As noted above in {section sign}2.2, this analysis would also need to grapple with which letters and positions the data can support estimating. There is a potential connection to the data sparsity worries here: the n=2 example neurons may have the only selectivity profiles for which the relevant interactions could be estimated at all.

      3.2) The "hierarchical processing" claim is motivated by neurons selective to features at multiple levels - graphemes and sub-graphemes in reading, single phonemes and diphthongs in listening. This claim is not specified mechanistically. The paper does not state what kind of structural linguistic hierarchy is intended (segmental phonology to syllabic structure?), what kind of hierarchical neurocomputational mechanism is being proposed, or why selectivity at multiple levels of a feature hierarchy is evidence for that mechanism rather than for any other mechanism (e.g., parallel feature detectors). As written, the claim is too underspecified to evaluate.

      3.3) The "forked letters" finding (selectivity to k, v, w, y, z) is potentially confounded with letter frequency and co-occurrence structure. These letters are low-frequency, with some exhibiting strong positional asymmetries, and they infrequently co-occur with other letters. Under the unique-variance analysis, decorrelation from other features inflates apparent unique variance even in the absence of genuine selectivity.

      3.4) The word-length effect in Figure 4 is established by PCA on the top five fusiform neurons, with no analysis showing the effect is qualitatively similar across a broader selection. Beyond establishing that something varies with word length, the paper makes no substantive claim about what the neural code represents - for instance, whether it reflects letter- or word-specific processing or a more general visual response to stimulus extent. Prior intracranial work has reported word-length effects in regions posterior to the VWFA but not within it (Thesen et al. 2012), raising the question of whether the effect reported here reflects letter-specific processing or a more general visual response that happens to correlate with stimulus extent.

      (4) Missed opportunities

      Several aspects of the paper are not so much wrong as underdeveloped, in ways that the authors are well-positioned to address.

      4.1) The cross-scale comparison between single-neuron, micro-wire, and macro-wire signals is presented descriptively, without articulating what conclusion these analyses support about the relationship between scales of measurement. Given the rarity of simultaneous recordings at these scales, this is a substantial missed opportunity. The rasters in Figure 2 visually suggest a tight relationship between spiking and micro-population activity that is not evident in the summary in Figure 2g. This discrepancy is not explained. Characterizing the functional and temporal relationship linking spike rates to micro- and macro-HGA is a substantive scientific question, and the paper is well-positioned to address it.

      4.2) The stimuli include controlled grammatical manipulations, but these manipulations are used as nuisance regressors in the TRF analyses rather than as the object of structured analysis. A design with controlled comparisons is being treated as if it were unconstrained naturalistic stimulation, which underuses the experimental structure the authors built.

      4.3) Finally, the paper foregrounds the dataset as a contribution but does not describe data sharing plans. Given that several of this review's recommendations call for analyses the authors have not yet done, the long-term value of the dataset to the community will depend substantially on what is shared and how.

      ​​Buchweitz, A., Mason, R. A., Tomitch, L. M., & Just, M. A. (2009). Brain activation for reading and listening comprehension: An fMRI study of modality effects and individual differences in language comprehension. Psychology & neuroscience, 2(2), 111-123.

      Jobard, G., Vigneau, M., Mazoyer, B., & Tzourio-Mazoyer, N. (2007). Impact of modality and linguistic complexity during reading and listening tasks. Neuroimage, 34(2), 784-800.<br /> Thesen, T., McDonald, C. R., Carlson, C., Doyle, W., Cash, S., Sherfey, J., Felsovalyi, O., Girard, H., Barr, W., Devinsky, O., Kuzniecky, R., & Halgren, E. (2012). Sequential then interactive processing of letters and words in the left fusiform gyrus. Nature communications, 3, 1284.

      Wilson, S. M., Bautista, A., & McCarron, A. (2018). Convergence of spoken and written language processing in the superior temporal sulcus. Neuroimage, 171, 62-74.

    5. Author response:

      We thank the editors and reviewers for their constructive feedback on our manuscript. We accept the reviewers' recommendations and will implement them fully in our revised manuscript and include all of the suggested literature references. Below, we highlight several key points raised during the evaluation and outline exactly how we will address them. We will also explicitly address every other point and minor recommendation raised by the reviewers in our final, comprehensive point-by-point response.

      Population-level quantification and statistical thresholds: The reviewers noted that our manuscript relied on single-neuron examples without fully demonstrating how widespread these patterns are across the recorded population. To address this, we will add population-level quantification across the recorded units using standard False Discovery Rate (FDR) corrections for multiple comparisons. We will include summary tables in the text and add statistical threshold lines to the distribution figures to report the proportion of significant neurons per region.

      Identifying amodal neurons: Reviewers raised concerns that our classification of amodal language neurons required a more direct test. We will provide additional measures of modality and, in particular, we will implement a cross-modal generalization analysis where our encoding models are trained on one modality (e.g., listening) and evaluated on the other (e.g., reading). This additional procedure will classify neurons as amodal if their cross-modal predictive performance exceeds a baseline null model.

      Isolating linguistic features from sensory confounds: A point was raised regarding whether some neurons were tracking low-level sensory properties (like sound amplitude or visual text size) rather than language features. We will address this by running encoding analyses that include additional basic acoustic envelopes and visual baseline properties as control variables. This will allow us to evaluate the unique variance explained by linguistic features after accounting for these low-level sensory baselines.

      Evaluating the "Compositional Code" in the Fusiform Gyrus: Reviewers pointed out that our claim regarding a "compositional code" (neurons tracking a combination of letter identity and position) was supported primarily by individual examples. To provide population-level context, we will perform a model comparison across our fusiform gyrus neurons. We will compare a baseline letter-only model against a model that includes letter-by-position interactions to report how many neurons statistically support this compositional structure.

      TRF Feature and procedure explanation: Reviewers requested clarification on the construction of our TRF features. We will update the Methods section to explicitly detail how the features were constructed for both modalities. We will also include a feature correlation matrix in the Supplementary Materials. Furthermore, in order to contrast low-level possible confounds and high-level linguistic features, we will also conduct a control analysis tracking, e.g., specific affixes across different structural roles – for example, comparing how neurons respond to the phoneme /-s/ when it functions as a plural number marker versus when it appears as part of a lexical item (e.g., pass) or a third-person verb agreement. We will conduct such analyses in addition to fitting the main TRF models with these additional confounds included, ensuring a clear dissociation between high and low-level features.

    1. eLife assessment

      This is a valuable survey of movements and locomotor patterns produced by circuits in the medial reticular formation (MRF) of the brainstem. The authors provide solid evidence that activation of GABAergic MRF neurons slowed down walking, activation of glutamatergic neurons induced a specific "shuffle" limb trajectory, and the activation of serotonergic neurons increased locomotor speed without affecting walking signature. This study adds to the growing body of knowledge about the effects of brainstem circuits on specific aspects of locomotor function.

    2. Reviewer #1 (Public Review):

      The medial reticular formation (MRF) in the brainstem has long been implicated in the regulation of locomotion. One common - albeit very simple - model often presents the MRF as a major relay station receiving inputs from MLR circuits, among other brain regions, that together convey locomotor signals through efferent projections targeting the caudal brainstem and the spinal cord. Yet, the MRF is a particularly large brain area whose cellular complexity is far from understood. How molecularly distinct MRF ensembles contribute to the regulation of locomotor behaviors is largely unknown. Here, the authors apply focal activation of either glutamatergic, GABAergic, or serotonergic neurons throughout the MRF using a chemogenetic gain-of-function approach to uncover the putative modulatory properties of these neuronal ensembles during walking. Using kinematic analysis of mice limbs during self-paced over-ground walkway locomotion, the authors find that activation of GABAergic MRF neurons can selectively slow down walking, whereas activation of glutamatergic neurons can induce a specific "shuffle" limb trajectory, altogether revealing that distinct MRF populations may retain the capability to engage divergent walking signatures, whose behavioral relevance are not yet clear. In contrast, the activation of serotonergic neurons did not affect walking signatures as described for the other two subgroups but led to an increase of locomotor speed. Interestingly, MRF neurons in each regional activation "hotspots" appear to target different domains in the lumbar spinal cord, suggesting that distinct circuit mechanisms are at play for the slowmo vs shuffle effects.

      Major points:

      1. While the experiments are carefully done and the results are well analyzed and clearly presented in a series of beautiful figures, several aspects of the methodology remain very confusing. In particular, the initial choice for the injection coordinates is not justified and the authors don't leverage the mapping of spinal projection neurons to drive their chemogenetic screen. Similarly, the authors group very different injection schemes (unilateral or bilateral targeting of MRF neurons), that should be analyzed separately. The choice of Z score cutoff that dictates the in-depth analysis of the chemogenetic phenotypes appears arbitrary and is not grounded in a set of objective criteria.

      2. One issue that arise from the work presented here is that we don't know if these MRF neurons are active during locomotion in normal, unperturbed conditions. Knowing the recruitment profile of these MRF neurons would clarify whether the chemogenetic activation boosts the firing of neurons that are already active during walking, or activate neurons that are otherwise silent. Disentangling between these possibilities may have a profound impact on the overall interpretation of the results.

      3. The results should be discussed in the broader context of historic stimulation experiments, notably in cats and other species, as well as more recent circuit mapping approaches in rodents. For instance, the notion that focal stimulation of distinct area within the MRF can elicit or modify the pattern of locomotion is not really new, so is the notion that some of these modulations are phase-specific and can influence the duration of single muscle activation during stance or swing phases. This last point has for instance already been assessed through individual muscle recordings paired with MRF stimulation in cats. Perhaps better introducing these key studies and a thorough discussion of what the results presented in this manuscript bring in terms of novelty will help readers ground this work into a more comprehensive and larger body of work.

    3. Reviewer #2 (Public Review):

      This paper is an interesting conceptual work where certain hotspot areas were found to induce unique gait patterns. These patterns differed from a classic change in speed or gait pattern from a walk to a gallop. From this, a hypothesis was formed that these areas could be important for possible alternative walking patterns seen, for example, during pathologies such as Parkinson's disease or perhaps related to stalking behaviors.

      While I liked the work and found it interesting, it remains descriptive in that the actual behaviors observed can't be causally related to a particular behavior such as stalking or shuffling. If the necessity or sufficiency of this region was related to a specific hunting behavior, for example, its interest to the field would be greater.

      Nevertheless, this paper does contribute to growing evidence that specific behaviors can be triggered by specific neuronal populations within the brainstem.

    4. Author response:

      Reviewer #1 (Public Review): 

      The medial reticular formation (MRF) in the brainstem has long been implicated in the regulation of locomotion. One common - albeit very simple - model often presents the MRF as a major relay station receiving inputs from MLR circuits, among other brain regions, that together convey locomotor signals through efferent projections targeting the caudal brainstem and the spinal cord. Yet, the MRF is a particularly large brain area whose cellular complexity is far from understood. How molecularly distinct MRF ensembles contribute to the regulation of locomotor behaviors is largely unknown. Here, the authors apply focal activation of either glutamatergic, GABAergic, or serotonergic neurons throughout the MRF using a chemogenetic gain-of-function approach to uncover the putative modulatory properties of these neuronal ensembles during walking. Using kinematic analysis of mice limbs during self-paced over-ground walkway locomotion, the authors find that activation of GABAergic MRF neurons can selectively slow down walking, whereas activation of glutamatergic neurons can induce a specific "shuffle" limb trajectory, altogether revealing that distinct MRF populations may retain the capability to engage divergent walking signatures, whose behavioral relevance are not yet clear. In contrast, the activation of serotonergic neurons did not affect walking signatures as described for the other two subgroups but led to an increase of locomotor speed. Interestingly, MRF neurons in each regional activation "hotspots" appear to target different domains in the lumbar spinal cord, suggesting that distinct circuit mechanisms are at play for the slowmo vs shuffle effects. 

      Major points: 

      (1) While the experiments are carefully done and the results are well analyzed and clearly presented in a series of beautiful figures, several aspects of the methodology remain very confusing. 

      A) In particular, the initial choice for the injection coordinates is not justified and the authors don't leverage the mapping of spinal projection neurons to drive their chemogenetic screen. 

      Thank you for pointing this out. To clarify this, we now start the results with an extra paragraph and accompanying figures (Figure 2 and its supplementary figures) in which we define the region of interest (ROI) within the mRF. The ROI is based upon the distribution of reticulospinal neurons in the brainstem mRF that connect directly with the lumbosacral enlargement (whether or not this ROI projects to other CNS sites), which contains the main networks important for hindlimb control during locomotion, including walking gait. Reticulospinal neurons in the mRF in the caudal pons and medulla oblongata form longitudinal columns that together occupy up to more than half of the entire brainstem. While the morphology of the medulla and caudal pons varies little from level to level, in contrast to rapid changes at the midbrain level, this doesn’t necessarily mean that the neuronal populations, even within neurotransmitter classes, are homogeneous in connectivity and function. We have now clearly denoted the rostrocaudally extensive field with its dorsoventral and mediolateral dimensions that comprises the anatomical region of interest in the new figure. While this dataset is rather basic, it allows us to directly refer back to it and clarify additional queries that came up related to the anatomy (i.e. that the hotspots for slomo- and shuffle-like gaits only cover a small portion of the reticulospinal field).

      We then included detailed anatomical mapping of the spinal projections for the identified hotspots for changes in walking quality (phenomenology), the central theme of the study, and immediately adjacent regions to highlight contrasting location-connectivity-functional properties between these adjacent sites. To better incorporate these mapping results we now present it directly following the walking function based transfection site mapping, but before delving into the details of the walking gait phenotypes. We did not systematically include mapping results from all sites in the mRF ROI into this manuscript as this was beyond the scope of this already very large functional-anatomical study. 

      B) Similarly, the authors group very different injection schemes (unilateral or bilateral targeting of MRF neurons), that should be analyzed separately. 

      We now clarify early in the results section how uni- and bilateral groups were composed and what the rationale was for this. As pilot data suggested that the slomo gait style was only seen following bilateral activation in VGaT-cre mice, but not in all bilateral cases, we designed the VGaT cohort to contain mainly bilateral injections, spread across the mRF region of interest, with a smaller group of unilateral injections to verify the pilot data. 

      For the shuffle gait style, pilot data suggested that both uni- and bilateral activation of VGluT2 neurons could elicit this style, but only in a subset of uni- and bilateral cases. Therefore we mainly included unilateral injections in this group with a smaller bilateral cohort for verification.  This approach served the main goal of the study, which was to map the walking style changes to subregions in the mRF.

      However, laterality is indeed very important when it comes to locomotor control. The effects of laterality on the walking gait styles generated from the hotspots were included in supplemental figures and accompanying Tables. We have now better highlighted these in the body of the text and we have added analyses of the motor tests for uni- or bilateral groups. 

      Furthermore, it should be noted that the uni- and bilateral groups are heterogeneous when it comes to rostrocaudal and dorsoventral placement within the mRF ROI. As such, we were not able to rigorously compare uni- versus bilateral activation effects while at the same time separating cases out by dorsoventral and rostrocaudal location (which would be needed to do justice to the functional anatomical organization of the mRF) as we do not have sufficient power in each of the subgroups (i.e. 3 rostrocaudal levels, with each a dorsal, intermediate and ventral region to target, which each would have to be injected unilaterally and bilaterally). This was beyond the scope of this already very large study. Further studies designed to balance ipsi- and contralateral groups will be necessary to map out the hotspots for mobility phenotypes that may be driven by the mRF beyond the slomo- and shuffle-hotspots or to systematically study the impact of laterality on mobility from the mRF.  

      To summarize, analyses of uni- vs bilateral stimulation demonstrate that bilateral inhibition within the slomo hotspot is necessary to create the slomo walking phenotype, and that unilateral inhibition within the shuffle hotspot is sufficient to create the shuffle walking phenotype (with bilateral stimulation not enhancing the phenotype further). Unilateral activation of the slomo hotspot did not induce asymmetries in gait or a reduction in motor performance, whereas unilateral activation of the shuffle hotspot induced an asymmetry in swing time but not stride length, with laterality affecting horizontal ladder but not other motor tests. Mice with transfection sites within the mRF region of interest but outside of the slomo and shuffle hotspots did not display these walking phenotypes but did display slowed walking without qualitative changes. The connectivity to spinal and other supraspinal substrates differed between these sites, providing clues for the substrates that mediate these differential functions.

      C) The choice of Z score cutoff that dictates the in-depth analysis of the chemogenetic phenotypes appears arbitrary and is not grounded in a set of objective criteria. 

      We are sorry that the Z score cutoff appeared arbitrary as that was not our intention. 

      The values to separate mice with and without a significant change were simply set at 2 standard deviations from the population mean in the control mice (i.e. Z=2). Two standard deviations from the population mean is widely used in all types of statistical analyses. We have now included the rationale for the cutoff of Z=2 in the text. Where group size allowed, to increase contrast between positive and negative groups in terms of gait characteristics, other behavioral assays and mapping, we used data from Z scores >3 (or < -3), but can assure that all moderately positive data (i.e. from mice with gait style Z scores between 2 and 3, and between -3 and -2) was reported as well in the statistical tables or supplementary figures. We have now included the links to theses supplementary tables and figures in the text, rather than only in the figure legends.

      The Z scores for the different gait styles indeed appear to map to discrete sites, but the Z score cutoff was not informed by these sites or by anatomical data. Similarly, Z scores for changes in tonic muscle activity elicited by activation of inhibitory neurons also mapped to a hotspot in the same rostrocaudal column as the slomo gait style, but further caudally. This further demonstrates the strength of function-based mapping. 

      (2) One issue that arise from the work presented here is that we don't know if these MRF neurons are active during locomotion in normal, unperturbed conditions. Knowing the recruitment profile of these MRF neurons would clarify whether the chemogenetic activation boosts the firing of neurons that are already active during walking, or activate neurons that are otherwise silent. Disentangling between these possibilities may have a profound impact on the overall interpretation of the results. 

      We agree that this knowledge would improve our ability to interpret and apply the findings of the current study. It is indeed important to learn when these mRF sites are being recruited, whether part of normal modulatory strategies in order to navigate through a complex environment or as part of specialized behavioral modules or both.  Another question is how loss of function in these sites impacts behavior and function. This concept has been added to the discussion and these questions can now be pursued in future experiments. 

      (3) The results should be discussed in the broader context of historic stimulation experiments, notably in cats and other species, as well as more recent circuit mapping approaches in rodents. For instance, the notion that focal stimulation of distinct area within the MRF can elicit or modify the pattern of locomotion is not really new, so is the notion that some of these modulations are phase-specific and can influence the duration of single muscle activation during stance or swing phases. This last point has for instance already been assessed through individual muscle recordings paired with MRF stimulation in cats. Perhaps better introducing these key studies and a thorough discussion of what the results presented in this manuscript bring in terms of novelty will help readers ground this work into a more comprehensive and larger body of work. 

      There is indeed a rich series of meticulous work done in cats, which included effects from stimulation of inhibitory and excitatory neurons on limb EMG, and rodent work focusing on excitatory mRF neurons. These studies show that distinct neurons or sites within the mRF drive distinct changes in motor readouts, albeit not described in terms of modulation of walking gait as we do here in terms of gait signatures. Despite this solid body of prior work, the notion of phase specificity and separate modulation of swing versus stance phase metrics has been underappreciated and therefore deserves to be emphasized. We have expanded the discussion to better highlight prior work and the interpretation of phase specificity has been enriched.  

      Reviewer #2 (Public Review): 

      This paper is an interesting conceptual work where certain hotspot areas were found to induce unique gait patterns. These patterns differed from a classic change in speed or gait pattern from a walk to a gallop. From this, a hypothesis was formed that these areas could be important for possible alternative walking patterns seen, for example, during pathologies such as Parkinson's disease or perhaps related to stalking behaviors. 

      While I liked the work and found it interesting, it remains descriptive in that the actual behaviors observed can't be causally related to a particular behavior such as stalking or shuffling. If the necessity or sufficiency of this region was related to a specific hunting behavior, for example, its interest to the field would be greater. 

      Nevertheless, this paper does contribute to growing evidence that specific behaviors can be triggered by specific neuronal populations within the brainstem. 

      We thank the reviewer for their thoughtful comments. We agree that more studies are necessary to understand how the slomo and shuffle hotspots serve behavioral repertoires (such as stalking or other internally driven activities) and adaptations (such as object avoidance or more subtle adjustments to terrain or internal cues). The experimental details of the present study leave ample leads for the research community to pursue these new directions.

    1. eLife Assessment

      This study provides valuable mechanistic insight into the mutually exclusive distributions of the histone variant H2A.Z and DNA methylation by testing two hypotheses: (i) that DNA methylation suppresses H2A.Z deposition by ATP-dependent chromatin remodelling complexes, and (ii) that DNA methylation destabilizes H2A.Z nucleosomes, thereby preventing H2A.Z retention. Through a series of well-designed and carefully executed experiments, solid support is presented for the first hypothesis. The evidence supporting the second hypothesis is less complete, and the extent to which either mechanism is responsible for H2A.Z exclusion from methylated DNA remains not entirely clear. This work will be of broad interest to researchers in chromatin biology and epigenetics.

    2. Reviewer #1 (Public review):

      Summary:

      The authors considered the mechanism underlying previous observations that H2A.Z is preferentially excluded from methylated DNA regions. They considered two non-mutually exclusive mechanisms. First, they tested the hypothesis that nucleosomes containing both methylated DNA and H2A.Z might be intrinsically unstable due to their structural features. Second, they explored the possibility that DNA methylation might impede SRCAP-C from efficiently depositing H2A.Z onto these DNA methylated regions.<br /> Their structural analyses revealed subtle differences between H2A.Z-containing nucleosomes assembled on methylated versus unmethylated DNA. To test the second hypothesis, the authors allowed H2A.Z assembly on sperm chromatin in Xenopus egg extracts and mapped both H2A.Z localization and DNA methylation in this transcriptionally inactive system. They compared these data with corresponding maps from a transcriptionally active Xenopus fibroblast cell line. This comparison confirmed the preferential deposition or enrichment of H2A.Z on unmethylated DNA regions, an effect that was much more pronounced in the fibroblast genome than in sperm chromatin. Furthermore, nucleosome assembly on methylated versus unmethylated DNA, along with SRCAP-C depletion from Xenopus egg extracts, provided a means to test whether SRCAP-C contributes to the preferential loading of H2A.Z onto unmethylated DNA.

      Strengths:

      The strength and originality of this work lie in its focused attempt to dissect the unexplained observation that H2A.Z is excluded from methylated genomic regions.

      Weaknesses:

      The study has two weaknesses. First, although the authors identify specific structural effects of DNA methylation on H2A.Z-containing nucleosomes, they do not provide evidence demonstrating that these structural differences lead to altered histone dynamics or nucleosome instability. Second, building on the elegant work of Berta and colleagues (cited in the manuscript), the authors implicate SRCAP-C in the selective deposition of H2A.Z at unmethylated regions. Yet the role of SRCAP-C appears only partial, and the study does not address how the structural or molecular consequences of DNA methylation prevent efficient H2A.Z deposition. Finally, additional plausible mechanisms beyond the two scenarios the authors considered are not investigated or discussed in the manuscript.

      Comments on revisions:

      The authors have addressed all previously raised concerns and propose a revised version of the manuscript. Notably, the abstract and discussion sections have been improved, and new experimental data have been incorporated. Collectively, these revisions enhance the rigor and clarity of the data interpretation and discussion.

      Given these improvements, this reviewer believes that the manuscript could be published, particularly if this publication is accompanied by the critical points discussed in the rebuttal letter.

    3. Reviewer #2 (Public review):

      This manuscript aims to elucidate the mechanistic basis for the long-standing observation that DNA methylation and the histone variant H2A.Z occupy mutually exclusive genomic regions. The authors test two hypotheses: (i) that DNA methylation intrinsically destabilizes H2A.Z nucleosomes, thereby preventing H2A.Z retention, and (ii) that DNA methylation suppresses H2A.Z deposition by ATP-dependent chromatin-remodelling complexes. The revised manuscript addresses a number of previous concerns, and the manuscript has therefore improved accordingly. However, several limitations remain.

      Comments on revisions:

      The authors have addressed a number of my previous concerns, and the manuscript has improved accordingly. However, several limitations remain that, in my view, constrain the strength of the conclusions. In particular, the absence of a direct comparison with a canonical nucleosome assembled on the same DNA template. This control is essential to determine whether the observed effects are specific to H2A.Z or reflect more general properties of methylated DNA-nucleosome interactions. Notably, even within the authors' own data, there is a trend suggesting that methylated canonical H2A nucleosomes may also exhibit increased accessibility. Although this does not reach statistical significance, the authors themselves argue that subtle differences can be biologically meaningful; it is therefore plausible that extended digestion conditions (e.g., longer HinfI exposure) could reveal a significant effect. Unless a direct structural comparison with a canonical nucleosome is performed, the possibility that the reported phenomenon is not specific to H2A.Z remains. This is compounded by the reliance on a single restriction enzyme-based assay, which represents a limited experimental approach. Such an approach is insufficient to unequivocally support the central claim that DNA methylation increases accessibility of H2A.Z-containing nucleosomes. Additional orthogonal assays would be required to substantiate this conclusion. With respect to the cryo-EM analysis of methylated and unmethylated 601L H2A.Z nucleosomes, and in general, the authors still do not adequately consider the positional context of CpG methylation. Extensive literature demonstrates that the effects of DNA methylation on canonical nucleosome structure and stability are highly position-dependent. Without accounting for the location of methylated CpGs relative to key DNA-histone contact sites, the structural data remain difficult to interpret mechanistically. Overall, while the manuscript has improved, it remains a relatively limited study that draws broad mechanistic conclusions from a minimal experimental data.

    4. Reviewer #3 (Public review):

      Summary:

      Histone variant H2A.Z is evolutionarily conserved among various species. The selective incorporation and removal of histone variants on the genome play crucial roles in regulating nuclear events, including transcription. Shih et al. aimed to address antagonistic mechanisms between histone variant H2A.Z deposition and DNA methylation. To this end, the authors reconstituted H2A.Z nucleosomes in vitro using methylated or unmethylated human satellite II DNA sequence and examined how DNA methylation affects H2A.Z nucleosome structure and dynamics. The cryo-EM analysis revealed that DNA methylation induces a more open conformation in H2A.Z nucleosomes. Consistent with this, their biochemical assays showed that DNA methylation subtly increases restriction enzyme accessibility in H2A.Z nucleosomes compared with canonical H2A nucleosomes. The authors identified genome-wide profiles of H2A.Z and DNA methylation using genomic assays and found their unique distribution between Xenopus sperm pronuclei and fibroblast cells. Using Xenopus egg extract systems, the authors showed SRCAP complex, the chromatin remodelers for H2A.Z deposition, preferentially bind to unmethylated DNA to deposit H2A.Z.

      Strengths:

      The experiments are rigorously performed, and interpretations are clear. The study presents a high-resolution cryo-EM structure of human H2A.Z nucleosome with methylated DNA. Although the effect of DNA methylation on the physical stability of the H2A.Z nucleosome is subtle, this would be important finding that warrants further functional investigation. The discovery that the SRCAP complex senses DNA methylation is novel and provides important mechanistic insight into the antagonism between H2A.Z and DNA methylation.

      Weaknesses:

      The authors have satisfactorily addressed my concerns.

    5. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1.

      We appreciate the constructive comments, which greatly improved this manuscript.

      Reviewer #2.

      We appreciate Reviewer #2's thorough analysis of our manuscript. However, we are concerned that the reviewer criticized a conclusion different from the one we claim in the manuscript. Although Reviewer #2's public comment stated, "Such an approach is insufficient to unequivocally support the central claim that DNA methylation increases accessibility of H2A.Z-containing nucleosomes", we did not draw such a bold conclusion. In the Abstract, we cautiously described that the impact of DNA methylation we observed was subtle and based on satellite II-derived DNA sequences. We made a nuanced proposal regarding this observation, stating, "Altogether, we propose that SRCAP drives the biased association of H2A.Z to unmethylated DNA, while additional mechanisms, potentially taking advantage of the subtle DNA methylation-induced physical effects, further assist the exclusion of H2A.Z from methylated DNA". We believe our analysis will contribute valuable insights into the mechanistic basis behind the antagonism between DNA methylation and H2A.Z.

      Reviewer #3.

      We appreciate the constructive comments, which greatly improved this manuscript.


      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study provides valuable mechanistic insight into the mutually exclusive distributions of the histone variant H2A.Z and DNA methylation by testing two hypotheses: (i) that DNA methylation destabilizes H2A.Z nucleosomes, thereby preventing H2A.Z retention, and (ii) that DNA methylation suppresses H2A.Z deposition by ATP-dependent chromatin remodeling complexes. Through a series of well-designed and carefully executed experiments, findings are presented in support of both hypotheses. However, the evidence in support of either hypothesis is incomplete, so that the proposed mechanisms underlying the enrichment of H2A.Z on unmethylated DNA remain somewhat speculative.

      We would like to thank the editor and reviewers for their critical assessments of our manuscript. While we do acknowledge the limitations of our work, we believe that our results provide important mechanistic insights into the long-standing question of how H2A.Z is preferentially enriched in hypomethylated genomic DNA regions. First, our structural and biochemical data suggest that DNA methylation increases the openness and physical accessibility of H2A.Z, albeit the effect is relatively subtle and is sequence-dependent. Second, using Xenopus egg extracts and synthetic DNA templates, we provide the first clear and direct evidence that DNA methylation-sensitive H2A.Z deposition is due to the H2A.Z chaperone SRCAP-C, corroborated by our discovery that SRCAP-C binding to DNA is suppressed by DNA methylation. Although the molecular details by which DNA methylation inhibits binding of SRCAP-C is an important area of future study, in our current manuscript, we do provide evidence that directly links the presence of SRCAP-C to the establishment of the DNA methylation/H2A.Z antagonism in a physiological system. Thanks to criticisms by the reviewers, we realized that we did not clearly state in our Abstract that the impact of DNA methylation on intrinsic H2A.Z nucleosome stability is relatively subtle, although we did explain these observations and limitations in the main text. In our revised manuscript, we are willing to edit the text to better clarify the criticisms raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors considered the mechanism underlying previous observations that H2A.Z is preferentially excluded from methylated DNA regions. They considered two non-mutually exclusive mechanisms. First, they tested the hypothesis that nucleosomes containing both methylated DNA and H2A.Z might be intrinsically unstable due to their structural features. Second, they explored the possibility that DNA methylation might impede SRCAP-C from efficiently depositing H2A.Z onto these DNA methylated regions.

      Their structural analyses revealed subtle differences between H2A.Z-containing nucleosomes assembled on methylated versus unmethylated DNA. To test the second hypothesis, the authors allowed H2A.Z assembly on sperm chromatin in Xenopus egg extracts and mapped both H2A.Z localization and DNA methylation in this transcriptionally inactive system. They compared these data with corresponding maps from a transcriptionally active Xenopus fibroblast cell line. This comparison confirmed the preferential deposition or enrichment of H2A.Z on unmethylated DNA regions, an effect that was much more pronounced in the fibroblast genome than in sperm chromatin. Furthermore, nucleosome assembly on methylated versus unmethylated DNA, along with SRCAP-C depletion from Xenopus egg extracts, provided a means to test whether SRCAP-C contributes to the preferential loading of H2A.Z onto unmethylated DNA.

      Strengths:

      The strength and originality of this work lie in its focused attempt to dissect the unexplained observation that H2A.Z is excluded from methylated genomic regions.

      Weaknesses:

      The study has two weaknesses. First, although the authors identify specific structural effects of DNA methylation on H2A.Z-containing nucleosomes, they do not provide evidence demonstrating that these structural differences lead to altered histone dynamics or nucleosome instability. Second, building on the elegant work of Berta and colleagues (cited in the manuscript), the authors implicate SRCAP-C in the selective deposition of H2A.Z at unmethylated regions. Yet the role of SRCAP-C appears only partial, and the study does not address how the structural or molecular consequences of DNA methylation prevent efficient H2A.Z deposition. Finally, additional plausible mechanisms beyond the two scenarios the authors considered are not investigated or discussed in the manuscript.

      Although we acknowledge the limitations of our study and are willing to expand our discussion to more thoroughly discuss these points, we believe our manuscript provides several important mechanistic insights which this reviewer may not have fully appreciated.

      Our first conclusion that H2A.Z nucleosomes on methylated DNA are more open and accessible compared to their unmethylated counterparts is supported by both our cryo-EM study and the restriction enzyme accessibility assay. Although the physical effect of DNA methylation is relatively subtle and is likely sequence dependent, as we clearly noted within the manuscript, the difference does exist and is valuable information for the chromatin field at large to consider.

      The second major conclusion of our manuscript is that SRCAP-C exhibits preferential binding to unmethylated DNA over methylated DNA, and that SRCAP-C represents the major mechanism that can explain the biased deposition of H2A.Z to unmethylated DNA in Xenopus egg extracts. Furthermore, our experiments using Xenopus egg extract clearly demonstrated that H2A.Z is deposited by both DNAmethylation sensitive and insensitive mechanisms. Depletion of SRCAP-C almost completely eliminated the levels of DNA-methylation-sensitive H2A.Z deposition and reduced the total level of H2A.Z on chromatin to less than half of that seen in non-depleted extract. This result demonstrated that DNA methylation-sensitive H2A.Z loading is primarily regulated by SRCAP-C, at least in our experimental context where transcription, replication, and other epigenetic modifications are not involved. It is likely that additional mechanisms do further contribute, implicated by our sequencing experiments, particularly at regions with active transcription, and we have noted these possibilities and the rationale for their existence in the Discussion.

      Our study also suggests that a SRCAP-independent, DNA methylation-insensitive mechanism of H2A.Z loading exists, which we suspect to be mediated by Tip60-C. In line with this possibility, our data suggest that Tip60-C binds DNA in a DNA methylation-insensitive manner in Xenopus egg extract. Since antibodies to deplete Tip60-C from Xenopus egg extract are currently unavailable, we were unable to directly test that hypothesis and decided not to include Tip60-C into our final model as we lacked experimental evidence for its role. However, whether or not Tip60-C is the complex responsible for the DNA methylation-insensitive pathway does not influence our final conclusion that SRCAP-C plays a major role in DNA methylation-sensitive H2A.Z loading. We are planning to edit our manuscript to more comprehensively discuss these points.

      Please note that while Berta et al reported that DNA methylation increases at H2A.Z loci in tumors defective in SRCAP-C, they selected those regions based off where H2A.Z is typically enriched within normal tissues (Berta et al., 2021). They did not show data indicating whether H2A.Z is still retained specifically at those analyzed loci upon mutation of SRCAP-C subunits. Thus, although we greatly admire their work and are pleased that many of our findings align with theirs, their paper did not directly address whether SRCAP-C itself differentiates between DNA methylation status nor the impact that has on H2A.Z and DNA methylation colocalization. In contrast, our Xenopus egg extract system, where de novo methylation is undetectable (Nishiyama et al., 2013; Wassing et al., 2024) offers a unique opportunity to examine the direct impact of DNA methylation on H2A.Z deposition using controlled synthetic DNA substrates. Corroborated with our demonstration that DNA binding of SRCAP-C is suppressed by DNA methylation, we believe that our manuscript provides a specific mechanism that can explain the preferential deposition of H2A.Z at hypomethylated genomic regions.

      Reviewer #2 (Public review):

      This manuscript aims to elucidate the mechanistic basis for the long-standing observation that DNA methylation and the histone variant H2A.Z occupy mutually exclusive genomic regions. The authors test two hypotheses: (i) that DNA methylation intrinsically destabilizes H2A.Z nucleosomes, thereby preventing H2A.Z retention, and (ii) that DNA methylation suppresses H2A.Z deposition by ATPdependent chromatin-remodelling complexes. However, neither hypothesis is rigorously addressed. There are experimental caveats, issues with data interpretation, and conclusions that are not supported by the data. Substantial revision and additional experiments, including controls, would be required before mechanistic conclusions can be drawn. Major concerns are as follows:

      We appreciate the critical assessment of our manuscript by this reviewer. Although we acknowledge the limitations of our study and will revise the manuscript to better describe them, we would like to respectfully argue against the statement that our "conclusions […] are not supported by the data".

      (1) The cryo-EM structure of methylated H2A.Z nucleosomes is insufficiently resolved to address the central mechanistic question: where the methylated CpGs are located relative to DNA-histone contact points and how these modifications influence H2A.Z nucleosome structure. The structure provides no mechanistic insights into methylation-induced destabilization.

      The fact that the DNA resolution in the methylated structure was not high enough to resolve the positions of methylated CpGs despite a high overall resolution of 2.78 Å implies that 1) the Sat2R-P DNA was not as stably registered as the 601L sequence, requiring us to create two alternative Sat2R-P atomic models to account for the variable positioning in our samples, and 2) that the presence of DNA methylation increases that positional variability. We understand that one may prefer to see highly resolved density around each methylation mark, but we do believe that our inability to accomplish that is actually a feature rather than a weakness and has important biological implications. The decrease in local DNA resolution on the methylated Sat2R-P structure compared to its unmethylated counterpart is meaningful and suggests to us that DNA methylation weakens overall DNA wrapping and positioning on the nucleosome, supported by the increased flexibility seen at the linker DNA ends as well as an increase in the population of highly shifted nucleosomes amongst the methylated particles. Additionally, one major view in the DNA methylation/nucleosome stability field is that the presence of DNA methylation can make DNA stiffer and harder to bend, causing opening and destabilization of nucleosomes (Ngo et al., 2016). The increased opening of linker DNA ends and accessibility of methylated H2A.Z nucleosomes in our hands also aligns with such an idea, again suggesting decreased histone-DNA contact stability on methylated DNA substrates. We plan to revise the writing in our manuscript to better reflect these ideas.

      The experimental system also lacks physiological relevance. The template DNA sequence is artificial, despite the existence of well-characterised native genomic sequences for which DNA methylation is known to inhibit H2A.Z incorporation. Alternatively, there are a number of studies examining the effect of DNA methylation on nucleosome structure, stability, DNA unwrapping, and positioning. Choosing one of these DNA sequences would have at least allowed a direct comparison with a canonical nucleosome. Indeed, a major omission is the absence of a cryo-EM structure of a canonical nucleosome assembled on the same DNA template - this is essential to assess whether the observed effects are H2A.Z-specific.

      The reviewer raises a fair question about whether canonical H2A would experience the same DNA methylation-dependent structural effects. We had considered solving the H2A structures, however, ultimately decided against it for a few reasons. First, there already exists crystal structures of canonical H2A nucleosomes using a DNA sequence highly similar to our Sat2R-P with and without the presence of DNA methylation (PDB: 5CPI and 5CPJ). The authors of this study did not see any physical differences present in their structures (Osakabe et al., 2015). Additionally, we had included canonical H2A conditions within our restriction enzyme accessibility assay and did not see a significant impact of DNA methylation on those samples (Fig 3). Because of the previous report and our own negative data, we expected that only limited additional insights would be obtained from the canonical H2A structures and decided not to pursue that analysis.

      One of the primary reasons we chose the Sat2R-P sequence was, as noted above, that there already was a published study examining how DNA methylation affects nucleosome structure using a variant of this sequence which we could compare to our results, as the reviewer has suggested. We did have to modify the sequence, namely by making it palindromic, in order to increase the final achievable resolution. We viewed the Sat2R-P sequence as an attractive candidate because it is physiologically relevant; the initial sequence was taken directly from human satellite II. Several modifications were made for technical reasons, including making the sequence palindromic as described above and also ensuring that each CpG is recognizable by a methylation-sensitive restriction enzyme so that we could be certain about the degree of methylation on our substrates. These practical concerns outweighed the necessity of maintaining a strict physiological sequence to us. However, we still believe the final Sat2R-P more closely mimics physiological sequences than Widom 601. Additionally, human satellite II is a highly abundant sequence in the human genome that is known to undergo large methylation changes on the onset of many disorders, like cancer, as well as during aging. Thus, there are interesting biological questions surrounding how the methylation state of this particular sequence affects chromatin structure.

      Furthermore, it has been reported that satellite II is devoid of H2A.Z (Capurso et al., 2012). Beyond those reasons, the satellite II sequence is generally interesting to our lab because we have been studying genes involved in ICF syndrome, where hypomethylation of satellite II sequences forms one of the hallmarks of this disorder (Funabiki et al., 2023; Jenness et al., 2018; Wassing et al., 2024). We understand that sequence context plays a large role in nucleosome wrapping and stability. This is why we strived to test multiple sequences in each of our assays. We do agree that it would be interesting to use DNA sequences where H2A.Z binding has already been described to be affected in a DNA methylation-dependent manner, forming an exciting future study to pursue.

      Furthermore, the DNA template is methylated at numerous random CpG sites. The authors' argument that only the global methylation level is relevant is inconsistent with the literature, which clearly demonstrates that methylation effects on canonical nucleosomes are position-dependent. Not all CpG sites contribute equally to nucleosome stability or unwrapping, and this critical factor is not considered.

      We did not argue that only the global methylation level is relevant. We also would appreciate it if the reviewer could provide specific references that "clearly demonstrates that methylation effects on canonical nucleosomes are position-dependent". We are aware of a series of studies conducted by Chongli Yuan's group, including one testing the effect of placing methylated CpGs at different positions along the Widom 601 sequence. In that study (Jimenez-Useche et al., 2013), they did find that positioning of mCpGs has differential impacts on the salt resistance of the nucleosomes, with 5 tandem mCpG copies at the dyad causing the most dramatic nucleosome opening whereas having mCpGs only at the DNA major grooves, but not elsewhere, increased nucleosome stability. However, they did also find that methylation of the original Widom 601 sequence also caused destabilization, albeit to a lesser degree, and another study by the same group (Jimenez-Useche et al., 2014) also found that CpG methylation decreased nucleosome-forming ability for all tested variants of the Widom 601 sequence, regardless of CpG density or positioning.

      Other studies monitored how distribution of methylated CpGs correlates with nucleosome positioning (Collings et al., 2013; Davey et al., 1997; Davey et al., 2004). However, these studies assessed the sequence-dependent effects specifically on nucleosome assembly during in vitro salt dialysis, which is a different physical process than the one our manuscript focuses on, especially when considering the fact that H2A.Z is deposited onto preassembled H2A-nucleosome. Our cryo-EM analysis examines the structural changes induced by DNA methylation on already formed nucleosomes rather than the process of formation. Thus, probing accessibility changes using a restriction enzyme was the more appropriate biochemical assay to verify our structures.

      We do very much agree that DNA context can influence nucleosome stability under different conditions. A study of molecular dynamics simulations concluded that the "combination of overall DNA geometrical and shape properties upon methylation" makes nucleosomes resistant to unwrapping (Li et al., 2022), while another modeling study suggests that DNA methylation impacts nucleosome stability in a manner dependent on DNA sequence, where "[s]trong binding is weakened and weak binding is strengthened" (Minary and Levitt, 2014). While G/C-dinucleotides are preferentially placed at major groove-inward positions in the nucleosomes in vivo (Chodavarapu et al., 2010; Segal et al., 2006) and G/C-rich segments are excluded from major groove-outward positions in Widom 601-like nucleosomes (Chua et al., 2012), methylated CpG dinucleotides are preferably, if not exclusively, located at major groove-outward positions in vivo. Mechanisms behind this biased mCpG positioning on the nucleosome remain speculative, likely caused by a combination of multiple factors, but the fact that we did not observe clear structural impacts using the Widom 601L sequence, where mCpGs are located at the major groove-outward and -inward positions ((Chua et al., 2012) and our structure), deserves a space for discussion. On the other hand, positioning of mCpG on satellite II-derived sequences that we used in this study was based on a physiological sequence, and thus it may not be appropriate to say that those CpGs are placed at multiple "random" positions. Although we decided not to discuss the position of 5mC on our Sat2R nucleosome structure due to ambiguous base assignments, neither of our two atomic models is consistent with an idea that DNA methylation repositions the CpG to the outward major grooves. As the potential contribution of how DNA methylation affects the nucleosome structure via modulating DNA stiffness has been extensively studied (Choy et al., 2010; Li et al., 2022; Ngo et al., 2016; Perez et al., 2012), we believe that it is appropriate to consider overall DNA properties along the whole DNA sequence, though we are willing to discuss potential positional effects in the revised manuscript.

      Perhaps one of the most important points that we did not emphasize enough in our original manuscript was that in contrast to the subtle intrinsic effect of DNA methylation that was DNA sequence dependent, we observed SRCAP-dependent preferential H2A.Z deposition to unmethylated DNA over methylated DNA in both 601 and satellite II DNAs. In the revised manuscript, we will make the value of comparative studies on 601 and satellite II in two distinct mechanisms.

      Finally, and most importantly, the reported increase in accessibility of the methylated H2A.Z nucleosome is negligible compared with the much larger intrinsic DNA accessibility of the unmethylated H2A.Z nucleosome. These data do not support the authors' hypothesis and contradict the manuscript's conclusions. Claims that methylated H2A.Z nucleosomes are "more open and accessible" must therefore be removed, and the title is misleading, given that no meaningful impact of DNA methylation on H2A.Z nucleosome stability is demonstrated.

      We respectfully disagree with this reviewer's criticism. We investigated the potential impact of DNA methylation on nucleosome stability to the best of our abilities through complementary assays and reported our observations. The effect of DNA methylation is smaller than the difference between H2A.Z and H2A, but we were able to see an effect. It is also not uncommon for small differences to have functional impacts in biological systems. We agree that further testing is required to determine whether this subtle effect is functionally important, and it remains the subject of future research due to the many technical challenges associated with addressing said question. We would like to note that 18 years have passed since Daniel Zilberman first reported the antagonistic relationship between H2AZ and DNA methylation (Zilberman et al., 2008) but very few studies have since directly tested specific mechanistic hypotheses. We believe that our study lays the groundwork for exciting future investigation that better elucidates the pathways that contribute to this antagonism and will have meaningful impacts on the field in general. However, thanks to the reviewer's criticism, we realized that we did not clearly state in the Abstract the relatively subtle effect of DNA methylation on the intrinsic H2A.Z nucleosome stability. Therefore, we will accordingly revise the Abstract to make this point clearer.

      (2) The cryo-EM structures of methylated and unmethylated 601L H2A.Z nucleosomes show no detectable differences. As presented, this negative result adds little value. If anything, it reinforces the point that the positional context of CpG methylation is critical, which the manuscript does not consider.

      We believe the inclusion and factual reporting of negative data is important for the scientific community as one of the major issues currently in biology research is biased omission of negative data. We considered eLife as a venue to publish this work for this reason. We understand that the reviewer believes our 601L structures may detract from the overall message of our manuscript. We believe this data rather emphasizes the importance of DNA sequence context, something that the reviewer also rightfully notes. It is standard practice in the nucleosome field to use the Widom 601 sequence, along with its variants. Our experience has shown that use of an artificially strong positioning sequence may mask weaker physical effects that could play a physiological role. Thus, we were careful to validate all further assays with multiple DNA sequences and believed it important to report these sequence-dependent effects on nucleosome structure.

      (3) Very little H3 signal coincides with H2A.Z at TSSs in sperm pronuclei, yet this is neither explained nor discussed (Supplementary Figure 10D). The authors need to clarify this.

      Our H3 signal, which represents the global nucleosome population, is more broadly distributed across the genome than H2A.Z, which is known to localize at specific genomic sites. Since both histone types were sequenced to similar read depths, H3 peaks are generally shallower than H2A.Z and peak heights cannot be directly compared (i.e. they should be represented in separate appropriate data ranges).

      (4) In my view, the most conceptually important finding is that H2A.Z-associated reads in sperm pronuclei show ~43% CpG methylation. This directly contradicts the model of strict mutual exclusivity and suggests that the antagonism is context-dependent. Similarly, the finding that the depletion of SRCAP reduces H2A.Z deposition only on unmethylated templates is also very intriguing. Collectively, these result warrants further investigation (see below).

      (5) Given that H2A.Z is located at diverse genomic elements (e.g., enhancers, repressed gene bodies, promoters), the manuscript requires a more rigorous genomic annotation comparing H2A.Z occupancy in sperm pronuclei versus XTC-2 cells. The authors should stratify H2A.Z-DNA methylation relationships across promoters, 5′UTRs, exons, gene bodies, enhancers, etc., as described in Supplementary Figure 10A.

      We agree that the substantial presence of co-localized H2A.Z and DNA methylation specifically in the sperm pronuclei samples and the changes in pattern between nuclear types are highly interesting and require further investigation. However, we faced technical challenges in our sequencing experiments that made us refrain from conducting a more detailed analysis for fear of over-interpreting potential artifacts. These challenges mainly stemmed from the difficulties in collecting enough material from Xenopus egg extracts and Tn5’s innate bias towards accessible regions of the genome. Because of this, open regions of the genome tend to be overrepresented in our data (as noted in our Discussion), making it challenging to rigorously compare methylation profiles and H2A.Z/H3 associated genomic elements.

      While the degree of separation seems to be dependent on nuclei type, we still believe the antagonism exists in both the sperm pronuclei and XTC-2 samples when comparing H2A.Z methylation profiles to the corresponding H3 condition. Our study also demonstrates that H2A.Z is preferentially deposited to hypomethylated DNA in a manner dependent of SRCAP-C (the loss of SRCAP only reduces H2A.Z on unmethylated substrates) but an additional methylation-insensitive H2A.Z deposition mechanism also exists. We realized that this interesting point was not clearly highlighted in Abstract, so we will revise it accordingly.

      (6) Although H2A.Z accumulates less efficiently on exogenous methylated substrates in egg extract, substantial deposition still occurs (~50%). This observation directly challenges the strong antagonistic model described in the manuscript, yet the authors do not acknowledge or discuss it. Moreover, differences between unmethylated and methylated 601 DNA raise further questions about the biological relevance of the cryo-EM 601 structures.

      As depicted in Figure 6 and described in the Discussion, we clearly indicated that both methylation-sensitive and methylation-insensitive pathways exist to deposit H2A.Z within the genome. We also directly stated in our Discussion that a substantial proportion of H2A.Z colocalizes with DNA methylation both in our study as well as in previous reports, which is of major interest for future study. Additionally, we further discussed how the absence of transcription in Xenopus eggs is a likely reason for the more limited effect of DNA methylation restricting H2A.Z deposition in our egg extract system.

      As noted in our response to (2), the lack of a clear impact on our 601L structures implies that this is due to the extraordinarily strong artificial nucleosome positioning capacity of the 601 sequence and its variants. Since 601 is heavily used in chromatin biology, including within DNA methylation research, such negative data are still useful to include and publish.

      (7) The SRCAP depletion is insufficiently validated i.e., the antibody-mediated depletion of SRCAP lacks quantitative verification. A minimum of three biological replicates with quantification is required to substantiate the claims.

      We are willing to address this concern. However, please note that our data showed that methylation-dependent H2A.Z deposition is almost completely erased upon SRCAP depletion, indicating functionally effective depletion. The specificity of the custom antibody against Xenopus SRCAP was verified by mass spectrometry. Additionally, we have obtained the same effect using another commercially available SRCAP antibody, though we did not include this preliminary result in our original manuscript. Due to its relatively low abundance and high molecular weight, SRCAP western blot signals are weak, making it challenging to quantify the degree of depletion. We also believe that the value of quantification in this context, with the points noted above, is rather limited. In the past, our lab has published papers on depleting the H3T3 kinase Haspin from Xenopus egg extracts (Ghenoiu et al., 2013; Kelly et al., 2010) but were never able to detect Haspin via western blot. This protein was only detected by mass spectrometry specifically on nucleosome array beads with H3K9me3 (Jenness et al., 2018). However, depletion of Haspin was readily monitored by erasure of H3T3ph, the enzymatic product of Haspin. In these experiments, it was impossible, and not critical, to quantitatively monitor the depletion of Haspin protein in order to investigate its molecular functions. Similarly, in this current study, the important fact is that depletion of SRCAP suppressed methylation-sensitive H2A.Z deposition and quantifying the degree of SRCAP depletion would not have a major impact on this conclusion.

      (8) It appears that the role of p400-Tip60 has been completely overlooked. This complex is the second major H2A.Z deposition complex. Because p400 exhibits DNA methylation-insensitive binding (Supplementary Figure 14), it may account for the deposition of H2A.Z onto methylated DNA. This possibility is highly significant and must be addressed by repeating the key experiments in Figure 5 following p400-Tip60 depletion.

      We are aware that the Tip60 complex is a very likely candidate for mediating DNA methylation insensitive H2A.Z deposition, which is why we tested whether DNA binding of p400 is methylation sensitive. Therefore, the reviewer's statement that we "completely overlooked" Tip60-C’s role does not fairly report on our efforts. We wished to test the potential contribution of Tip60-C, but, unfortunately, the antibodies we currently have available to us were not successful in depleting the complex from egg extract. Since we had no direct experimental evidence indicating the role Tip60-C plays, we decided to take a conservative approach to our model and leave the methylation-insensitive pathway as mediated by something still unidentified. While further investigating Tip60-C’s contribution to this pathway is of definite value, we do not believe that it impacts our major conclusion that SRCAP-C is the main mediator responsible for H2A.Z deposition on unmethylated DNA and thus remains a subject for future study.

      (9) The manuscript repeatedly states that H2A.Z nucleosomes are intrinsically unstable; however, this is an oversimplification. Although some DNA unwrapping is observed, multiple studies show that H3/H4 tetramer-H2A.Z/H2B interactions are more stable (important recent studies include the following: DOI: 10.1038/s41594-021-00589-3; 10.1038/s41467-021-22688-x; and reviewed in 10.1038/s41576-02400759-1).

      We understand that the H2A.Z stability field is highly controversial. We have introduced the many conflicting reports that have been published in the field but can further expand on the controversies if desired. We also understand that the term “nucleosome stability” is broad and encompasses many physical aspects. As noted in a prior response, we will better specify our use of the term within the manuscript. In our assays, we are most focused on the DNA wrapping stability of the nucleosome and have consistently seen in our hands that H2A.Z nucleosomes are much more open and accessible compared to canonical H2A on satellite II-derived sequences, regardless of methylation status. However, we do understand that many groups have observed the opposite findings while others have obtained results similar to us. We reported on our findings of the general H2A.Z stability with the hopes to help clarify some of the field’s controversies.

      In summary, the current manuscript does not present a convincing mechanistic explanation for the antagonism between DNA methylation and H2A.Z. The observation that H2A.Z can substantially coexist with DNA methylation in sperm pronuclei, perhaps, should be the conceptual focus.

      We appreciate this reviewer’s advice. However, please note that the first author who led this project has already successfully defended their PhD thesis primarily based on this project, making it impractical and unrealistic to completely change the focus of this manuscript to include an entirely new avenue of research. We believe that our data provide important insights into the mechanisms by which H2A.Z is excluded from methylated DNA, particularly via the DNA methylation-sensitive binding of SRCAP-C, which has never been described before. We agree that many questions are still left unanswered, including the exact molecular mechanism behind how DNA methylation prevents SRCAP-C binding. We have preliminary data that suggest none of the known DNA-binding modules of SRCAP-C, including ZNHIT1, by themselves can explain this sensitivity. This implies that domain dissection in the context of the holo-SRCAP complex is required to fully address this question. We believe this represents a very exciting future avenue of study; however, it does not negate our finding that SRCAP-C itself is important for maintaining the DNA methylation/H2A.Z antagonism. Therefore, we respectfully disagree with this reviewer's summary statement, which misleadingly undermines the impact of our work.

      Reviewer #3 (Public review):

      Summary:

      Histone variant H2A.Z is evolutionarily conserved among various species. The selective incorporation and removal of histone variants on the genome play crucial roles in regulating nuclear events, including transcription. Shih et al. aimed to address antagonistic mechanisms between histone variant H2A.Z deposition and DNA methylation. To this end, the authors reconstituted H2A.Z nucleosomes in vitro using methylated or unmethylated human satellite II DNA sequence and examined how DNA methylation affects H2A.Z nucleosome structure and dynamics. The cryo-EM analysis revealed that DNA methylation induces a more open conformation in H2A.Z nucleosomes. Consistent with this, their biochemical assays showed that DNA methylation subtly increases restriction enzyme accessibility in H2A.Z nucleosomes compared with canonical H2A nucleosomes. The authors identified genome-wide profiles of H2A.Z and DNA methylation using genomic assays and found their unique distribution between Xenopus sperm pronuclei and fibroblast cells. Using Xenopus egg extract systems, the authors showed SRCAP complex, the chromatin remodelers for H2A.Z deposition, preferentially deposit H2A.Z on unmethylated DNA.

      Strengths:

      The study is solid, and most conclusions are well-supported. The experiments are rigorously performed, and interpretations are clear. The study presents a high-resolution cryo-EM structure of human H2A.Z nucleosome with methylated DNA. The discovery that the SRCAP complex senses DNA methylation is novel and provides important mechanistic insight into the antagonism between H2A.Z and DNA methylation.

      We are grateful that this reviewer recognizes the importance of our study.

      Weaknesses:

      The study is already strong, and most conclusions are well supported. However, it can be further strengthened in several ways.

      (1) It is difficult to interpret how DNA methylation alters the orientation of the H4 tail and leads to the additional density on the acidic patch. The data do not convincingly support whether DNA methylation enhances interactions with H2A.Z mono-nucleosomes, nor whether this effect is specific to methylated H2A.Z nucleosomes.

      The altered H4 tail orientation and extra density seen on the acidic patch were incidental findings that we thought could be interesting for the field to be aware of but decided not to follow up on as there were other structural differences that were more directly related to our central question. We do believe that the above two differences are linked to each other because we used a highly purified and homogenous sample for cryo-EM analysis and the H4 tail/acidic patch interaction is a well characterized contact that mediates inter-nucleosome interactions. Additionally, other groups have reported that the presence of DNA methylation causes condensation of both chromatin and bare DNA (cited within our manuscript), though the mechanics behind this phenomenon remain to be elucidated. We believed that our structure data may also align with those findings. However, the reviewer is fair in pointing out that we do not provide further experimental evidence in verifying the existence of these increased interactions. We can revise our writing to clarify that these points are currently hypotheses rather than validated results.

      (2) It remains unclear whether DNA methylation alters global H2A.Z nucleosome stability or primarily affects local DNA end flexibility. Moreover, while the authors showed locus-specific accessibility by HinfI digestion, an unbiased assay such as MNase digestion would strengthen the conclusions.

      We would like to thank the reviewer for bringing up these issues. Although our current data cannot explicitly clarify these possibilities, we favor an idea that DNA methylation specifically alters histone to DNA contacts and that this effect is felt globally across the entire nucleosome rather than only at specific locations. The intrinsic flexibility of linker DNA ends means that that region tends to exhibit the greatest differences under different physical influences, hence the focus on characterizing that area; flexibility of a thread on a spool is most pronounced at the ends. However, we also found that the DNA backbone of H2A.Z on methylated DNA had a lower local resolution compared to its unmethylated counterpart, despite that structure having a higher global resolution, which suggested to us that DNA positioning along the nucleosome is overall weaker under the presence of DNA methylation. This is corroborated by the increased population of open/shifted structures in our classification analysis. The reviewer raises a fair point about the use of a specific restriction enzyme versus MNase. We agree that our accessibility assay is highly influenced by the position of the restriction site and have previously seen that moving the cut site too close to the linker DNA end will abolish any DNA methylation-dependent differences. We did initially attempt an MNase digestion-based assay, but the data were not as reproducible as with the use of a specific restriction enzyme. We do not know the reason behind this irreproducibility though we believe that the processivity of MNase could make it difficult to capture subtle effects like those induced by DNA methylation on already highly accessible H2A.Z nucleosomes. Overall, while we believe that DNA methylation does exert a physical effect, its subtlety may explain the many contradictory studies present within the DNA methylation and nucleosome stability field.

      References

      Berta, D.G., H. Kuisma, N. Valimaki, M. Raisanen, M. Jantti, A. Pasanen, A. Karhu, J. Kaukomaa, A. Taira, T. Cajuso, S. Nieminen, R.M. Penttinen, S. Ahonen, R. Lehtonen, M. Mehine, P. Vahteristo, J. Jalkanen, B. Sahu, J. Ravantti, N. Makinen, K. Rajamaki, K. Palin, J. Taipale, O. Heikinheimo, R. Butzow, E. Kaasinen, and L.A. Aaltonen. 2021. Deficient H2A.Z deposition is associated with genesis of uterine leiomyoma. Nature. 596:398–403.

      Capurso, D., H. Xiong, and M.R. Segal. 2012. A histone arginine methylation localizes to nucleosomes in satellite II and III DNA sequences in the human genome. BMC Genomics. 13:630.

      Chodavarapu, R.K., S. Feng, Y.V. Bernatavichute, P.Y. Chen, H. Stroud, Y. Yu, J.A. Hetzel, F. Kuo, J. Kim, S.J. Cokus, D. Casero, M. Bernal, P. Huijser, A.T. Clark, U.

      Kramer, S.S. Merchant, X. Zhang, S.E. Jacobsen, and M. Pellegrini. 2010. Relationship between nucleosome positioning and DNA methylation. Nature. 466:388–392.

      Choy, J.S., S. Wei, J.Y. Lee, S. Tan, S. Chu, and T.H. Lee. 2010. DNA methylation increases nucleosome compaction and rigidity. J Am Chem Soc. 132:1782–1783.

      Chua, E.Y., D. Vasudevan, G.E. Davey, B. Wu, and C.A. Davey. 2012. The mechanics behind DNA sequence-dependent properties of the nucleosome. Nucleic Acids Res. 40:6338–6352.

      Collings, C.K., P.J. Waddell, and J.N. Anderson. 2013. Effects of DNA methylation on nucleosome stability. Nucleic Acids Res. 41:2918–2931.

      Davey, C., S. Pennings, and J. Allan. 1997. CpG methylation remodels chromatin structure in vitro. J Mol Biol. 267:276–288.

      Davey, C.S., S. Pennings, C. Reilly, R.R. Meehan, and J. Allan. 2004. A determining influence for CpG dinucleotides on nucleosome positioning in vitro. Nucleic Acids Res. 32:4322–4331.

      Funabiki, H., I.E. Wassing, Q. Jia, J.D. Luo, and T. Carroll. 2023. Coevolution of the CDCA7-HELLS ICF-related nucleosome remodeling complex and DNA methyltransferases. Elife. 12.

      Ghenoiu, C., M.S. Wheelock, and H. Funabiki. 2013. Autoinhibition and polo-dependent multisite phosphorylation restrict activity of the histone h3 kinase haspin to mitosis. Mol Cell. 52:734–745.

      Jenness, C., S. Giunta, M.M. Muller, H. Kimura, T.W. Muir, and H. Funabiki. 2018. HELLS and CDCA7 comprise a bipartite nucleosome remodeling complex defective in ICF syndrome. Proc Natl Acad Sci U S A. 115:E876–E885.

      Jimenez-Useche, I., J. Ke, Y. Tian, D. Shim, S.C. Howell, X. Qiu, and C. Yuan. 2013. DNA methylation regulated nucleosome dynamics. Sci Rep. 3:2121.

      Jimenez-Useche, I., D. Shim, J. Yu, and C. Yuan. 2014. Unmethylated and methylated CpG dinucleotides distinctively regulate the physical properties of DNA. Biopolymers. 101:517–524.

      Kelly, A.E., C. Ghenoiu, J.Z. Xue, C. Zierhut, H. Kimura, and H. Funabiki. 2010. Survivin reads phosphorylated histone H3 threonine 3 to activate the mitotic kinase Aurora B. Science. 330:235– 239.

      Li, S., Y. Peng, D. Landsman, and A.R. Panchenko. 2022. DNA methylation cues in nucleosome geometry, stability and unwrapping. Nucleic Acids Res. 50:1864–1874.

      Minary, P., and M. Levitt. 2014. Training-free atomistic prediction of nucleosome occupancy. Proc Natl Acad Sci U S A. 111:6293–6298.

      Ngo, T.T., J. Yoo, Q. Dai, Q. Zhang, C. He, A. Aksimentiev, and T. Ha. 2016. Effects of cytosine modifications on DNA flexibility and nucleosome mechanical stability. Nat Commun. 7:10813.

      Nishiyama, A., L. Yamaguchi, J. Sharif, Y. Johmura, T. Kawamura, K. Nakanishi, S. Shimamura, K. Arita, T. Kodama, F. Ishikawa, H. Koseki, and M. Nakanishi. 2013. Uhrf1-dependent H3K23 ubiquitylation couples maintenance DNA methylation and replication. Nature. 502:249–253.

      Osakabe, A., F. Adachi, Y. Arimura, K. Maehara, Y. Ohkawa, and H. Kurumizaka. 2015. Influence of DNA methylation on positioning and DNA flexibility of nucleosomes with pericentric satellite DNA. Open Biol. 5.

      Perez, A., C.L. Castellazzi, F. Battistini, K. Collinet, O. Flores, O. Deniz, M.L. Ruiz, D. Torrents, R. Eritja, M. Soler-Lopez, and M. Orozco. 2012. Impact of methylation on the physical properties of DNA. Biophys J. 102:2140–2148.

      Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I.K. Moore, J.P. Wang, and J. Widom. 2006. A genomic code for nucleosome positioning. Nature. 442:772–778.

      Wassing, I.E., A. Nishiyama, R. Shikimachi, Q. Jia, A. Kikuchi, M. Hiruta, K. Sugimura, X. Hong, Y. Chiba, J. Peng, C. Jenness, M. Nakanishi, L. Zhao, K. Arita, and H. Funabiki. 2024. CDCA7 is an evolutionarily conserved hemimethylated DNA sensor in eukaryotes. Sci Adv. 10:eadp5753.

      Zilberman, D., D. Coleman-Derr, T. Ballinger, and S. Henikoff. 2008. Histone H2A.Z and DNA methylation are mutually antagonistic chromatin marks. Nature. 456:125–129.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors designed two sets of experiments to explore the molecular mechanisms underlying the mutually exclusive distribution of H2A.Z and DNA methylation previously reported by several groups.

      First, they examined how DNA methylation affects the physical stability of H2A.Z-containing nucleosomes. Although their results point to subtle differences between nucleosomes assembled on methylated versus unmethylated DNA, the authors did not extend their analyses to directly test the stability of these H2A.Z-containing nucleosomes under more challenging conditions. Prior studies have demonstrated that certain nucleosomes, such as those containing H3.3-H2A.Z or H2A.Z-H3K56Q, exhibit specific instability, but such instability is only revealed under challenging conditions, for example, altered salt concentrations or the presence of additional factors like FACT (PMID: 17575053; PMID: 19633671; PMID: 19639024; PMID: 41303375). In light of this literature, the observable structural features noted here for nucleosomes containing H2A.Z and methylated DNA are suggestive of increased instability, yet the authors did not employ comparable approaches to rigorously test whether such instability might explain the absence of H2A.Z from methylated genomic regions.

      As a result, at this stage of analysis, the idea that nucleosomes containing both H2A.Z and methylated DNA are intrinsically unstable, and that this instability accounts for the depletion of H2A.Z from methylated regions, remains unsubstantiated.

      We thank the reviewer's constructive criticisms. Through our response to these points, we were able to significantly improve our manuscript, including major rewriting of the Abstract and Discussion as well as incorporation of new data.

      We agree that combinations with other histone variants, modifications, and mutations could further affect our observed impact of DNA methylation on H2A.Z-nucleosome stability. What we observed based on satellite II-derived DNA was that DNA methylation made H2A.Znucleosomes (with H3.2) more open, although the effect of DNA methylation is relatively small (as compared to the general impact of H2A.Z incorporation). We readily admit that such a subtle physical effect is unlikely to be the main driver of the antagonistic distribution of H2A.Z and DNA methylation, though small physical changes have been known to influence larger biological functions, and sought to describe additional regulatory factors that could play major roles.

      We also agree that H3.3 is of major interest when discussing H2A.Z. In our Xenopus egg extract experiments using DNA beads, the primary H3 variant deposited is H3.3 as no DNA replication occurs on the beads to allow for H3.1/.2 replication-coupled deposition. From those experiments, we demonstrated that preferential loading of H2A.Z can be primarily explained by SRCAP. In other words, in the absence of SRCAP, loading/retention of H2A.Z on H3.3nucleosomes was not noticeably affected by DNA methylation, indicating that DNA methylation’s physical effects on H2A.Z nucleosomes plays little, if any, role in the preferential accumulation of H2A.Z on unmethylated DNA at least in the context of synthetic DNA beads incubated in

      Xenopus egg extract lacking active transcription. Our sequencing data hints at the interesting possibility that transcription, along with other factors missing in egg extract, may be involved in further pruning H2A.Z from methylated DNA which conceivably could take advantage of subtle physical alterations. However, we agree we lack firm supporting evidence for such a mechanism which led us to forgo including that in our final model figure and we instead only report on our observations with discussions on potential biological implications and limitations. Of note, it has been reported that the H2A.Z nucleosome is more accessible than the H2A nucleosome, while inclusion of H3.3 does not further enhance accessibility of the H2A.Z nucleosome (PMID 38920622). We have now noted these points in the Discussion of our revised manuscript.

      We appreciate and agree with this reviewer’s point that nucleosome instability sometimes requires challenging conditions to be fully revealed. However, in our system, use of H2A.Z was the challenge provided as we find in our hands that H2A.Z by itself substantially destabilizes histone-DNA contacts compared to canonical H2A. And it is only with this already destabilized nucleosome that we see further enhancement of accessibility/openness in the presence of DNA methylation. This is similar to findings by [PMID: 23260052] that reported that only an intrinsically destabilized sub-population of canonical H2A nucleosomes on 601 DNA experienced detectable physical changes in the presence of DNA methylation.

      In response to this reviewer's comment, we edited the Abstract and Discussion to clearly note the subtly of the impact of DNA methylation on H2A.Z nucleosome structure, and that the potential functional significance remains an open question.

      Second, the authors investigated whether SRCAP-C contributes to preferential H2A.Z incorporation into unmethylated DNA. The absence of H2A.Z from methylated regions does not necessarily imply that it cannot be incorporated there; it may instead reflect the chromatin environment associated with DNA methylation, which could disfavor SRCAP-C activity, whereas open chromatin environments strongly promote SRCAP-dependent H2A.Z deposition.

      This reviewer suggested an alternative model where SRCAP prefers to act on open chromatin and that the apparent preferential H2A.Z deposition to unmethylated DNA is due solely to the increased accessibility associated with unmethylated DNA. Following such a model, one would predict that SRCAP-C's preference to unmethylated DNA would be eliminated on nucleosome-free DNA in Xenopus egg extracts. To test this alternative model, we repeated the SRCAP-C binding experiment in egg extracts depleted of the HIRA complex, the H3.3-H4 chaperone responsible for de novo nucleosome assembly on exogenously added DNA in egg extracts. Contrary to this prediction, both SRCAP and ZNHIT1 still display preferential binding to unmethylated DNA substrates in HIRA-depleted extracts in which nucleosome assembly is suppressed (newly added Suppl Fig 16). The results argue that discrimination of SRCAP-C from methylated DNA is not due to a potential effect of chromatin compaction by DNA methylation. Furthermore, our new result is in line with an idea that SRCAP employs 1D diffusion on the linker DNA before engaging the H2A nucleosome (PMID 39131301), implying that discrimination of SRCAP-C from methylated linker DNA contributes to this process. This is now illustrated in the new model Figure 6.

      Please note we also indicate in both our model and in text that there exists an additional methylation-insensitive mechanism that drives H2A.Z deposition on methylated DNA, leading to a substantial amount of colocalized H2A.Z and DNA methylation. Why two different deposition pathways for H2A.Z differing in their methylation sensitivities must exist is an interesting topic for future work and has not been described prior to our report.

      This interpretation is consistent with the authors' own comparative mapping of H2A.Z and DNA methylation in sperm pronuclei incubated in egg extract versus a transcriptionally active Xenopus fibroblast line. They observed that about 40% of H2A.Z-associated genomic DNA is methylated in sperm pronuclei, but only 3% in fibroblasts. As they note, the major difference between these systems is the presence of transcription in fibroblasts, a process known to drive H2A.Z eviction/recycling, and which is absent in the egg-extract system. Thus, no specific inhibition of SRCAP-C by methylated DNA needs to be invoked: H2A.Z deposition on both methylated and unmethylated accessible regions, followed by preferential eviction from methylated sites in active nuclei, could fully account for the observed patterns.

      As the reviewer correctly notes here, we proposed that transcription is likely to play an important role in pruning H2A.Z from methylated DNA. Our observations and proposed mechanism do not argue against the possible existence of a DNA methylation-insensitive, transcription-dependent mechanism that promotes dissociation of H2A.Z from methylated DNA, which we believe likely would be correlated to gene body methylation. In fact, we did propose in our Discussion that such a transcription-mediated mechanism may conceivably take advantage of the subtly destabilized DNA wrapping of H2A.Z nucleosomes on methylated DNA to further selectively prune H2A.Z at colocalized regions. However, such a mechanism would be an additional component to what we have already described and does not explain the observed preferential recruitment of SRCAP-C to unmethylated DNA in Xenopus egg extracts in the absence of active transcription.

      In this respect, studies from the Felsenfeld laboratory showing that double-variant nucleosomes are highly unstable under physiological ionic conditions are particularly relevant (PMID: 19633671; PMID: 19639024). They demonstrated that such unstable nucleosomes are only evident under low ionic strength extraction conditions, emphasizing that the apparent absence of H2A.Z may reflect facilitated removal rather than failure of assembly.

      The authors may also have been influenced by the study of Berta et al. (cited in the manuscript), which examined uterine leiomyomas harboring somatic or germline mutations in SRCAP-C subunits. In those tumors, the normal association of H2A.Z with accessible, active chromatin, and its exclusion from methylated regions, was lost. However, this observation does not demonstrate that SRCAP-C actively prevents H2A.Z incorporation into methylated DNA. Instead, it may simply reflect that in the absence of SRCAP-C, a default, less efficient deposition pathway operates regardless of whether the chromatin environment is normally permissive or restrictive for SRCAP-dependent activity.

      Even if one accepts the more straightforward interpretation proposed by the present authors, that SRCAP-C is actively inhibited by methylated DNA, as suggested by their pull-down experiments from Xenopus egg extracts using unmethylated and methylated DNA, the hypothesis lacks mechanistic support.

      Considering this reviewers' criticism, we have expanded our discussion to indicate a possibility that SRCAP-C may have an alternative mechanism to find open chromatin independent of DNA methylation status. However, our data show that SRCAP-C preferentially binds to unmethylated DNA in a manner independent of transcription or other epigenetic status in Xenopus egg extracts, and that SRCAP-C carries the major mechanism that explains preferential deposition of H2A.Z to unmethylated DNA. Therefore, we believe that our study for the first time offers a mechanistic explanation of how H2A.Z discrimination from methylated DNA is accomplished through SRCAP-dependent H2A.Z deposition.

      The following points summarize the issues discussed above:

      (1) The authors did not sufficiently test the hypothesis that H2A.Z-methylated DNA nucleosomes are inherently unstable and could explain the exclusion of H2A.Z from methylated genomic regions.

      We stand by our conclusion that DNA methylation has an intrinsic capacity to make the H2A.Z nucleosome more open and accessible, even though the effect is subtle. We did not argue that this subtle effect can fully explain the exclusion of H2A.Z from methylated genomic regions. Rather, our Xenopus egg extract experiment suggested that in the transcriptionally inactive egg extract setting, such a mechanism plays little or no role and it is SRCAP-C instead that is the major driver. Whether this physical mechanism also contributes to their exclusion in cells with active transcription remains a future subject of study.

      (2) The proposed active role of SRCAP-C in preventing H2A.Z assembly on methylated DNA is supported only by limited experimental data and lacks a mechanistic explanation. In particular, this hypothesis does not account for the significant H2A.Z assembly observed on methylated DNA regions in sperm nuclei after incubation in egg extract.

      We respectfully disagree with this summary assessment. Our conclusions are well aligned with the substantial H2A.Z association with methylated DNA in sperm pronuclei assembled in Xenopus egg extracts seen. We demonstrated that:

      (1) In transcriptionally-silent Xenopus egg extracts using synthetic DNA beads, DNAbinding of SRCAP-C is inhibited by DNA methylation.

      (2) In this set up, H2A.Z is preferentially, if not exclusively, loaded to unmethylated DNA over methylated DNA.

      (3) Depletion of SRCAP-C almost completely eliminated preferential association of H2A.Z to unmethylated DNA, while leaving some DNA methylation-insensitive H2A.Z loading.

      (4) These data indicate the presence of a SRCAP-C-dependent, DNA methylationsensitive mechanism as well as a SRCAP-C-independent, DNA-methylation-insensitive mechanism to load H2A.Z to chromatin. This conclusion matches well with our genomic analysis showing that H2A.Z is preferentially but not exclusively loaded to hypomethylated genomic segments to sperm pronuclei in Xenopus egg extracts.

      (5) As we clearly discussed, this SRCAP-C-dependent mechanism by itself is insufficient to explain the much clearer exclusion of H2A.Z in somatic cells. We discussed the possibility that transcription contributes to further pruning of H2A.Z from methylated DNA.

      To deliver this overall message with nuances that we noted above, we have heavily revised the Abstract, the model Figure 6, and Discussion. Thanks to the criticisms raised by this reviewer, we believe that our revised manuscript has been significantly improved.

      Reviewer #2 (Recommendations for the authors):

      (1) A major omission is the absence of a cryo-EM structure of a canonical nucleosome assembled on the same DNA template - this is essential to assess whether the observed effects are H2A.Z-specific.

      We had considered solving the H2A structures, however, ultimately decided against it for a few reasons. First, there already exists crystal structures of canonical H2A nucleosomes using a DNA sequence highly similar to our Sat2R-P with and without the presence of DNA methylation (PDB: 5CPI and 5CPJ). The authors of this study did not see any physical differences present in their structures (Osakabe et al., 2015). Additionally, we had included canonical H2A conditions within our restriction enzyme accessibility assay and did not see a significant impact of DNA methylation on those samples (Fig 3). Because of the previous report and our own negative data, we expected that only limited additional insights would be obtained from the canonical H2A structures and decided not to pursue that analysis, considering the cost and effort for this additional cryo-EM analysis.

      (2) The reported increase in accessibility of the methylated H2A.Z nucleosome is negligible compared with the much larger intrinsic DNA accessibility of the unmethylated H2A.Z nucleosome. Claims that methylated H2A.Z nucleosomes are "more open and accessible" must therefore be removed, and the title is misleading, given that no meaningful impact of DNA methylation on H2A.Z nucleosome stability is demonstrated.

      We respectfully disagree with this reviewer's criticism. We investigated the potential impact of DNA methylation on nucleosome stability to the best of our abilities through complementary assays and reported our observations. The effect of DNA methylation is smaller than the difference between H2A.Z and H2A, but we were able to see an effect. It is also not uncommon for small differences to have functional impacts in biological systems. We agree that further testing is required to determine whether this subtle effect is functionally important, and it remains the subject of future research due to the many technical challenges associated with addressing said question. We would like to note that 18 years have passed since Daniel Zilberman first reported the antagonistic relationship between H2AZ and DNA methylation (Zilberman et al., 2008) but very few studies have since directly tested specific mechanistic hypotheses. We believe that our study lays the groundwork for exciting future investigation that better elucidates the pathways that contribute to this antagonism and will have meaningful impacts on the field in general. However, thanks to the reviewer's criticism, we realized that we did not clearly state in the Abstract that the effect of DNA methylation on intrinsic H2A.Z nucleosome stability is relatively subtle. We will accordingly revise the Abstract, the model Figure 6, and Discussion to make this point clearer.

      (3) The cryo-EM structures of methylated and unmethylated 601L H2A.Z nucleosomes show no detectable differences. As presented, this negative result adds little value and should be removed.

      We believe the inclusion and factual reporting of negative data is important for the scientific community as one of the major issues currently in biology research is biased omission of negative data. We considered eLife as a venue to publish this work for this reason. We understand that the reviewer believes our 601L structures may detract from the overall message of our manuscript, however, we believe that this data rather emphasizes the importance of DNA sequence context, something that the reviewer also rightfully notes. It is standard practice in the nucleosome field to use the Widom 601 sequence, along with its variants. Our experience has shown that use of an artificially strong positioning sequence may mask weaker physical effects that could play a physiological role. Thus, we were careful to validate all further assays with multiple DNA sequences and believed it important to report these sequence-dependent effects on nucleosome structure.

      (4) Very little H3 signal coincides with H2A.Z at TSSs in sperm pronuclei, yet this is neither explained nor discussed (Supplementary Figure 10D). The authors need to clarify this.

      Our H3 signal, which represents the global nucleosome population, is more broadly distributed across the genome than H2A.Z, which is known to localize at specific genomic sites. Since both histone types were sequenced to similar read depths, H3 peaks are generally shallower than H2A.Z and peak heights cannot be directly compared (i.e. they should be represented in separate appropriate data ranges).

      (5) In my view, the most conceptually important finding is that H2A.Z-associated reads in sperm pronuclei show ~43% CpG methylation. This directly contradicts the model of strict mutual exclusivity and suggests that the antagonism is context-dependent. Similarly, the finding that the depletion of SRCAP reduces H2A.Z deposition only on unmethylated templates is also very intriguing. Collectively, these result warrants further investigation (see below).

      (6) Given that H2A.Z is located at diverse genomic elements (e.g., enhancers, repressed gene bodies, promoters), the manuscript requires a more rigorous genomic annotation comparing H2A.Z occupancy in sperm pronuclei versus XTC-2 cells. The authors should stratify H2A.ZDNA methylation relationships across promoters, 5′UTRs, exons, gene bodies, enhancers, etc., as described in Supplementary Figure 10A.

      We appreciate recognition of the importance of our finding by this reviewer. We agree that the substantial presence of co-localized H2A.Z and DNA methylation specifically in the sperm pronuclei samples and the changes in pattern between nuclear types are highly interesting and require further investigation. However, we faced technical challenges in our sequencing experiments that made us refrain from conducting a more detailed analysis for fear of over-interpreting potential artifacts. These challenges mainly stemmed from the difficulties in collecting enough material from Xenopus egg extracts and Tn5’s innate bias towards accessible regions of the genome. Because of this, open regions of the genome tend to be overrepresented in our data (as noted in our Discussion), making it challenging to rigorously compare methylation profiles and H2A.Z/H3 associated genomic elements.

      While the degree of separation seems to be dependent on nuclei type, we still believe the antagonism exists in both the sperm pronuclei and XTC-2 samples when comparing H2A.Z methylation profiles to the corresponding H3 condition. Our study also demonstrates that H2A.Z is preferentially deposited to hypomethylated DNA in a manner dependent of SRCAP-C (the loss of SRCAP only reduces H2A.Z on unmethylated substrates) but an additional methylationinsensitive H2A.Z deposition mechanism also exists. We realized that this interesting point was not clearly highlighted in Abstract, so we will revise it accordingly.

      (7) Although H2A.Z accumulates less efficiently on exogenous methylated substrates in egg extract, substantial deposition still occurs (~50%). This observation directly challenges the strong antagonistic model described in the manuscript. The authors need to discuss this in more detail.

      As depicted in Figure 6 and described in the Discussion, we indicated that both methylation-sensitive and methylation-insensitive pathways exist to deposit H2A.Z within the genome. We also directly stated in our Discussion that a substantial proportion of H2A.Z colocalizes with DNA methylation both in our study as well as in previous reports, which is of major interest for future study. Additionally, we further discussed how the absence of transcription in Xenopus eggs is a likely reason for the more limited effect of DNA methylation restricting H2A.Z deposition in our egg extract system. In the revised manuscript, we heavily edited the Discussion to better clarify these points.

      (8) The SRCAP depletion is insufficiently validated, i.e., the antibody-mediated depletion of SRCAP lacks quantitative verification. A minimum of three biological replicates with quantification is required to substantiate the claims.

      In response to this, quantification of the SRCAP depletion is now included as Supplementary Figure 13A and B. Since our anti-ZNHIT1 antibodies reproducibly detected ZNHIT1 on DNA beads isolated from egg extracts, we have conducted additional verification of the SRCAP depletion by probing for SRCAP and ZNHIT1 on DNA beads, confirming that these proteins were depleted on DNA beads upon immunodepletion with anti-SRCAP antibodies (Author response image 1). To further validate this conclusion, we added data showing that the effect of SRCAP depletion on methylation-sensitive H2A.Z deposition was reproduced through use of a different commercially available antibody raised against human SRCAP (newly added Suppl Fig 14).

      Author response image 1.

      Verification of SRCAP depletion using DNA beads. DNA beads were incubated in interphase-cycled Xenopus egg extract that had been depleted with either our custom SRCAP antibody or an IgG negative control. SRCAP and ZNHIT1 association was then assessed via Western Blot.

      (9) It appears that the role of p400-Tip60 has been completely overlooked. This complex is the second major H2A.Z deposition complex. Because p400 exhibits DNA methylation-insensitive binding (Supplementary Figure 14), it may account for the deposition of H2A.Z onto methylated DNA. This possibility is highly significant and must be addressed by repeating the key experiments in Figure 5 following p400-Tip60 depletion.

      Thank you very much for raising this interesting point. We were aware that the TIP60 complex is a very likely candidate for mediating DNA methylation-insensitive H2A.Z deposition, which is why we tested whether DNA binding of p400 is methylation sensitive (shown in the revised Supplementary Figure 15). We wished to test the potential contribution of TIP60-C, but, unfortunately, the antibodies we currently have available to us were not successful in depleting the complex from egg extract. Since we had no direct experimental evidence indicating the role TIP60-C plays, we decided to take a conservative approach to our model and leave the methylation-insensitive pathway as mediated by something still unidentified. While further investigating TIP60-C’s contribution to this pathway is of definite value, we do not believe that it impacts our major conclusion that SRCAP-C is the main mediator responsible for H2A.Z deposition on unmethylated DNA and thus remains a subject for future study. However, we have now added descriptions to note that TIP60-C is a likely candidate to execute the SRCAPindependent and methylation-insensitive mechanism of H2A.Z loading in Xenopus egg extracts. In the model figure, we initially did not include Tip60-C, but we now infer TIP60-C is a likely candidate in the revised model (Figure 6) to facilitate the future research in the field.

      (10) The manuscript repeatedly states that H2A.Z nucleosomes are intrinsically unstable; however, this is an oversimplification. Although some DNA unwrapping is observed, multiple studies show that H3/H4 tetramer-H2A.Z/H2B interactions are more stable (important recent studies include the following: DOI: 10.1038/s41594-021-00589-3; 10.1038/s41467-021-22688-x; and reviewed in 10.1038/s41576-024-00759-1). These references should be considered.

      We appreciate that the reviewer points out this important issue. Although we had described that controversy exists regarding how H2A.Z and DNA methylation contributes to nucleosome stability, it was not clearly explained. We understand that this confusion was in part due to the term “nucleosome stability”, which is broad and encompasses many physical aspects. As noted in a prior response, we now better specify our use of the term within the manuscript, emphasizing the nucleosome openness and accessibility, particularly at the nucleosome core particle entry/exit sites. As noted by published studies (PMID 38920622), the impact on nucleosome stability may differ between the internal and external segments of nucleosomal DNA. In our assays, we are most focused on the DNA wrapping stability of the nucleosome and have consistently seen in our hands that H2A.Z nucleosomes are much more open and accessible at DNA ends compared to canonical H2A on satellite II-derived sequences, regardless of methylation status. However, we do understand that many groups have observed the opposite findings while others have obtained results similar to us. This may be caused by usage of different assays (for example, nucleosome assembly during salt dialysis or salt sensitivity vs openness/accessibility of preassembled nucleosome). In the Discussion of the revised manuscript, we now explain these factors, with the hope that our study will help clarify some of the field’s controversies.

      Reviewer #3 (Recommendations for the authors):

      (1) Since the cryo-EM structure determined by single-particle analysis represents only one major population, it would be important to determine the dyad axis position by complementary biochemical assays, such as MNase-seq or chemical digestion by the Fenton reaction (PMID: 22929776).

      We would like to thank the reviewer for bringing up this important issue. We agree that the high-resolution structure represents only a subpopulation in which we specifically selected for the most stably wrapped nucleosomes in each sample. This issue is why we then supplemented our high-resolution structure with our in-silico classification analysis to survey the overall structure distribution of the full nucleosome particle population. The classification input contains all nucleosome-like particles picked from both unmethylated and methylated sample micrographs mixed together, ensuring that all particles are taken into consideration and that both samples have been analyzed in an identical manner. From our sorting analysis, we find an increased population of open and shifted nucleosome structures present in our methylated DNA sample, indicating destabilization of DNA-histone wrapping with DNA methylation. This is corroborated by the lower local resolution seen on the DNA backbone of our high-resolution H2A.Z on methylated DNA structure, despite it having a higher global resolution compared to its unmethylated counterpart. This suggested to us that DNA positioning along the nucleosome is overall weaker under the presence of DNA methylation.

      The reviewer raises a fair point about the use of a specific restriction enzyme versus MNase. We agree that our accessibility assay is highly influenced by the position of the restriction site and have previously seen that moving the cut site too close to the linker DNA end will abolish any DNA methylation-dependent differences. We realized that we did not explain how we decided to place the HinfI site in the context of our solved cryo-EM structure. In the revised Figure 3B, we now illustrate that the HinfI site is located at a segment where H2A/H2A.Z directly contacts the DNA and explained that this segment belongs to the region that exhibited clear methylation-induced flexibility in our cryo-EM structures. Thus, our structure helped us design this experiment.

      We did initially attempt an MNase digestion-based assay, but the data were not as reproducible as with the use of a specific restriction enzyme. We do not know the reason behind this irreproducibility though we believe that the processivity of MNase could make it difficult to capture subtle effects like those induced by DNA methylation on already highly accessible H2A.Z nucleosomes, as subtle technical errors in the MNase concentration can have significant effects. Overall, while we believe that DNA methylation does exert a physical effect, its subtlety may explain the many contradictory studies present within the DNA methylation and nucleosome stability field.

      (2) I assume that the authors confirmed complete DNA methylation by restricted enzyme digestion. It would be helpful to include this validation in supplementary figures.

      We would like to thank the reviewer for pointing out that this critical verification was missing from our initial manuscript. DNA methylation of Sat2R-P and Sat2R was verified via BstBI digestion (Suppl Fig 1B and 7D, respectively); 601L verified with HpaII digestion (Suppl Fig 6B); and 19x601 DNA verified via BstUI digestion (Suppl Fig 11A). All data has been added to the specified figures. Unfortunately, the 16xHSat2 DNA substrate we used in our assays does not contain appropriate cut-sites for methylation-sensitive restriction enzymes. Due to that, we always prepared the 16xHSat2 DNA in parallel with the 19x601 substrate under identical conditions then use digestion of the 19x601 substrate to verify quality of methylation for each batch. To more directly verify methylation of 16xHSat2 DNA, we used Xenopus laevis ZHX2 and ZHX3, which we recently identified as proteins that selectively associate with methylated DNA in Xenopus egg extracts. Although identification and characterization of Xenopus ZHX2/3 will be described elsewhere, previous published proteomic studies have also identified mammalian ZHXs as proteins that enrich on methylated DNA (PMID 21029866, 23434322). By incubating DNA beads in Xenopus egg extract and probing for endogenous ZHX2/3 (our antibody recognizes both ZHX2 and ZHX3), we verified that ZHXs selectively binds to methylated 16xHSat2 but not unmethylated DNA (Author response image 2). Although this does not necessarily verify that all CpGs in 16xHSat2 were methylated, we observed comparable methylation-induced inhibition of SRCAP binding between 16x601 and 16HSat2, supporting our conclusion.

      Author response image 2.

      Verification of 16xHSat2 methylation status via ZHX2/3 protein binding. 16xHSat2 DNA beads were incubated in Xenopus egg extract and endogenous ZHX2/3 protein binding assessed via Western Blot with a custom generated antibody that recognizes both ZHX2 and ZHX3.

      (3) Figure 1A: The dyad position is difficult to identify. Please indicate it clearly using a distinct color (not green).

      We now directly indicate each sequence midpoint with a black triangle and also changed the font of DNA sequences to further clarify that the dyad resides at the palindromic center.

    1. eLife Assessment

      This study reports on the development and characterization of chickens with genetic deficiencies in type I or type III interferon receptors, which is an important contribution to the field of avian immunology. The data reflecting the development of the new interferon-receptor-deficient chickens is compelling. The initial characterization of IFN biology and infection responses in these knockout chickens provides a solid foundation for future studies on the distinct contributions of type I and type III interferon signaling to antiviral responses.

    2. Reviewer #2 (Public review):

      Summary:

      This is a laudable effort to help dissect the contributions of type I and type III IFNs to the antiviral response in chicken and therefore represents an important piece of work, not least in the light of birds being a key carrier and worldwide distributor of influenza virus. The first part of the study characterises the generation of IFNAR and IFNLR KO chicken strains and describes basic differences. Four different viruses are then tested in chicken embryos, while the subsequent analysis of the antiviral response in vivo is performed with one influenza H3N1 strain.

      Strengths:

      Having these two KO chicken strains as a tool is a great achievement. The initial analysis is solid. Clear effect of IFNAR deficiency in in vivo infection, less so for IFNLR deficiency.

      Weaknesses:

      (1) The antibody induction by KLH immunisation: We still don't know whether or not this vaccination induces IFN responses in wt mice, so it is still not possible to judge whether the effects observed are due to steady-state differences or to differential effects of IFN induced during the vaccination phase. Pre-immune results are now shown and are indeed zero. As suggested, the whole figure 4 is now condensed into one or two panels by proper calculation of Ab titers - would these titres be significantly different? This as all of the other in vivo experiments have not been repeated if I understand the methods section correctly. I understand that there are three R restrictions that are tighter in some countries, and I accept that with the numbers used here, some statistical significance is reached, but this is for instance not the case for survival.

      (2) The basic conundrum here and in later figures is now addressed by the authors in the discussion: Situations where IFN type 1 and 3 signalling deficiency each have an independent effect (i.e. fig.4d) suggest that they act by separate, unrelated mechanisms. However, all the literature about these IFN families suggest that they show almost identical signalling and gene induction downstream of their respective receptors. How can the same signalling, clearly active here downstream of the receptors for IFN type 1 or type 3, be non-redundant, i.e. why does the unaffected IFN family not stand in? The mouse studies, which showed a rather subtle phenotype when only one of the two IFN systems was missing, but a massive reduction in virus control in double KO mice, are discussed, but a clear-cut explanation for the differences has not been reached. Reasons could be a direct effect of IFNab on B cells and an indirect effect of IFNL through non-B cells, timing issues, and many other scenarios can be envisaged. The authors do not address this question experimentally, which limits the depth of analysis, they have however now included a discussion of this dilemma.

      (3) In the one in vivo experiment performed with chickens, only one virus tested, more influenza strains should be included as well as non-influenza viruses. I appreciate that this is logistically difficult.

      (4) The basic conundrum of point 2 applies equally to Fig. 6a, both KOs have a phenotype. Again, in 6d, both IFNs appear to be separately required for Mx induction. An explanation has been attempted, but more experiments, for instance looking at different time points to understand if we are dealing simply with different kinetics of the response, have not been attempted, despite the fact that such experiments are likely not covered by strict three R rules.

      (5) The in vivo infection is the most interesting experiment, and the key outcome here is that IFN type 1 is crucial for anti-H3N1 protection in chickens, while type 3 is less impactful. However, this experiment suffers from the different time points when chickens were culled, so many parameters are impossible to compare (e.g. weight loss, histopathology). Some explanation is given as to the comparisons chosen here, but a more thorough analysis at several time points would have strengthened this study.

      Comments on revised version:

      In the rebuttal, the authors have gone to some length to add to the discussion of the experiments, and some aspects are better explained now than before. Many of these explanations remain speculative however, so the study remains inconclusive in several aspects. As no new data was added, my overall judgement of this study remains unchanged.

    3. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript presents an extensive body of work and an outstanding contribution to our understanding of the IFN type I and III system in chickens. The research started with the innovative approach of generating KO chickens that lack the receptor for IFNα/β (IFNAR1) or IFN-λ (IFNLR1). The successful deletion and functional loss of these receptors was clearly and comprehensively demonstrated in comparison to the WT. Moreover, the homozygous KO lines (IFNAR1-/- or IFNLR1-/-) were found to have similar body weights, and normal egg production and fertility compared to their WT counterparts. These lines are a major contribution to the toolbox for the study of avian/chicken immunology.

      The significance of this contribution is further demonstrated by the use of these lines by the authors to gain insight into the roles of IFN type I and IFN-type III in chickens, by conducting in ovo and in vivo studies examining basic aspects of immune system development and function, as well as the responses to viral challenges conducted in ovo and in vivo.

      Based on solid, state-of the-art methods and convincing evidence from studies comparing various immune system related functions in the IFNAR1-/- or IFNLR1-/- lines to the WT, revealed that the deletion of IFNAR1 and/or IFNLR1 resulted in:

      (1) impaired IFN signaling and induction of anti-viral state;

      (2) modulation of immune cell profiles in the peripheral blood circulation and spleen;

      (3) modulation of the cecum microbiome;

      (4) reduced concentrations of IgM and IgY in the blood plasma before and following immunization with model antigen KLH, whereby also line differences in the time-course of the antibody production were observed;

      (5) decrease in MHCII+ macrophages and B cells in the spleen of IFNAR1 KO chickens, although the MHCII-expression per cell was not affected in this line; and

      (6) reduction in the response of αβ1 TCR+ T cells of IFNAR1 KO chickens as suggested by clonal repertoire analyses.

      These studies were then followed by examination of the role of type I and type III IFN in virus infection, using different avian influenza A virus strains as well as an avian gamma corona virus (IBV) in in ovo challenge experiments. These studies revealed: viral titers that reflect virus-species and strain-specific IFN responses; no differences in the secretion of IFN-α/β in both KO compared to the WT lines; a predominant role of type I IFN in inducing the interferon-stimulated gene (ISG) Mx; and that an excessive and unbalanced type I IFN response can harm host fitness (survival rate, length of survival) and contribute to immunopathology.

      Based on guidance from the in ovo studies, comprehensive in vivo studies were conducted on host-pathogen interactions in hens from the three lines (WT, IFNAR1 KO, or IFNLR1 KO). These studies revealed the early appearance of symptoms and poor survival of hens from the IFNR1 KO line challenged with H3N1 avian influenza A virus; efficient H#N1 virus replication in IFNAR1 KO hens, increased plasma concentrations of IFNα/β and mRNA expression of IFN-λ in spleens of the IFNAR1 KO hens; a pro-inflammatory role of IFN-λ in the oviduct of hens infected with H3N1 virus; increased proinflammatory cytokine expression in spleens of IFNAR1 KO hens, and Impairment of negative feedback mechanisms regulating IFN-α/β secretion in IFNAR1-KO hens and a significant decrease in this group's antiviral state; additionally it was demonstrated that IFN-α/β can compensate IFN-λ to induce an adequate antiviral state in the spleen during H3N1 infection, but IFN-λ cannot compensate for IFN-α/β signaling in the spleen.

      Strengths:

      (1) Both the methods and results from the comprehensive, well-designed, and well-executed experiments are considered excellent. The results are well and correctly described in the result narrative and well presented in both the manuscript and supplement Tables and Figures. Excellent discussion/interpretation of results.

      (2) The successful generation of the type I and type III IFN KO lines offers unprecedented insight and opens multiple new venues for exploring the IFN system in chickens. The new knowledge reported here is direct evidence of the high impact of this model system on effectively addressing a critical knowledge gap in avian immunology.

      (3) The thoughtful selection of highly relevant viruses to poultry and human health for the in ovo and in vivo challenge studies to examine and assess host-pathogen interactions in the IFNR KO and WT lines.

      (4) Making use of the unique opportunities in the chicken model to examine and evaluate the host's IFN system responses to various viral challenges in ovo, before conducting challenge studies in hens.

      (5) The new knowledge gained from the IFNAR1 and IFNLR1 KO lines will find much-needed application in developing more effective strategies to prevent health challenges like avian influenza and its devastating effects on poultry, humans, and other mammals.

      (6) The excellent cooperation and contributions of the co-authors and institutions.

      Weaknesses:

      No weaknesses were identified by this reviewer.

      We thank Reviewer #1 for the very positive and thoughtful evaluation of our manuscript. We appreciate the recognition of the effort involved in generating and characterizing the IFNAR1<sup>-/-</sup> and IFNLR1<sup>-/-</sup> chicken lines and for highlighting their significance as valuable tools for advancing avian immunology.

      We are grateful for the reviewer’s clear summary of our findings and for acknowledging the quality of the experimental design, data presentation, and interpretation. The encouraging feedback affirms the broader impact of our study and its contribution to understanding type I and type III interferon biology and antiviral defense mechanisms in chickens.

      We have carefully considered all reviewer comments and revised the manuscript accordingly to further clarify methodological details and improve the presentation of our results.

      Reviewer #1 (Recommendations for the authors):

      Minor suggestions/corrections:

      (1) Line 192, 193, 196 - the superscript "+" sign appears to be underlined.

      We corrected the formatting of all superscript "+" symbols (L 192-196).

      (2) L195: ...in the spleen "of both IIFNR KO lines" (or some clarification of what you are comparing).

      The sentence was revised to read “in the spleen of both IFNR knockout lines” for clarity (L 195).

      (3) L198: replace "highlighting" with "and".

      “Highlighting” was replaced with “and” as suggested (L 198).

      (4) L231 and 235: change "monocytes" to "macrophages" as this description appears to refer to spleen cells. Also, make this change in Figure 3b and in the Figure 3 caption (e.g. monocytes/macrophages).

      “Monocytes” was replaced with “macrophages” to accurately describe spleen cells. The same correction was made in Figure 3b and the Figure 3 caption as well as in the supplementary Figure 4 (L 229-234).

      (5) L257: indicate this significant difference in Figure 5b.

      The significant difference has now been clearly indicated in Figure 5b.

      (6) L420, 421: change "monocytes" to "macrophages" as this discussion appears to refer to the spleen.

      “Monocytes” was replaced with “macrophages” to reflect the correct cell type discussed in the spleen context (L 226-227).

      (7) L564-565: has the anti-human MX antibody been shown to cross-react with chicken Mx?

      We thank the reviewer for this valuable comment. Yes, the cross-reactivity of the anti-human MxA monoclonal antibody (clone M143, mouse IgGκ; Merck, Germany) with chicken Mx protein has been previously demonstrated. This antibody has been used successfully to detect chicken Mx in several published studies, including Schusser et al., Journal of Virology (2011). Accordingly, supporting references have been added to the revised manuscript (L584-586).

      (8) L608: how were PBMC and splenocytes (mononuclear spleen cells?) isolated -Line 647 on page 14 mentions their isolation using Histopaque-1077 density gradient centrifugation

      We thank the reviewer for this helpful comment. A detailed description of the isolation procedure for PBMCs and mononuclear spleen cells has now been added to the Materials and Methods section under the new subsection titled “Isolation of peripheral blood and splenic mononuclear cells” In this section, we specify that both PBMCs and splenic mononuclear cells were isolated using Histopaque®-1077 density gradient centrifugation as described on page (14), lines (668-676)

      Reviewer #2 (Public review):

      Summary:

      This study attempts to dissect the contributions of type I and type III IFNs to the antiviral response in chickens. The first part of the study characterises the generation of IFNAR and IFNLR KO chicken strains and describes basic differences. Four different viruses are then tested in chicken embryos, while the subsequent analysis of the antiviral response in vivo is performed with one influenza H3N1 strain.

      Strengths:

      Having these two KO chicken strains as a tool is a great achievement. The initial analysis is solid. Clear effect of IFNAR deficiency in in vivo infection, less so for IFNLR deficiency.

      Weaknesses:

      (1) The antibody induction by KLH immunisation: No data indicated whether or not this vaccination induces IFN responses in wt mice, so the effects observed may be due to steady-state differences or to differential effects of IFN induced during the vaccination phase. No pre-immune results are shown. The differences are relatively small and often found at only one plasma dilution - the whole of Figure 4 could be condensed into one or two panels by proper calculation of Ab titers - would these titres be significantly different? This, as all of the other in vivo experiments, has not been repeated, if I understand the methods section correctly.

      We thank the reviewer for the valuable comments and helpful suggestions.

      Regarding interferon induction by KLH immunisation, we agree that KLH is not known to strongly induce type I or type III interferon responses. Importantly, the goal of this experiment was not to quantify IFN induction per se, but to assess how the absence of IFN receptors affects adaptive antibody responses under standard immunisation conditions. KLH is a highly immunogenic, copper‑containing extracellular oxygen‑carrier protein derived from the marine gastropod Megathura crenulata and is widely used as a T cell–dependent model antigen to study B‑cell activation, antibody production, and class switching in vivo (Harris & Markl, Micron 1999, doi: 10.1016/s0968-4328(99)00036-0; Schusser et al., 2016, doi: 10.1002/eji.201546171). Because chickens are extremely unlikely to encounter KLH under natural conditions, KLH behaves as a neo‑antigen, and anti‑KLH antibodies can be considered to arise from de novo adaptive responses rather than pre‑existing antigen experience. Owing to its structural complexity and unusual glycosylation, KLH provides broad antigenic stimulation and engages adaptive immune mechanisms largely independently of pathogen‑specific innate pattern recognition, while still supporting robust T helper cell responses (Swaminathan et al., 2014, doi: 10.1111/bcp.12422; Geyer et al., 2004, doi: 10.1016/j.micron.2003.10.033). This makes KLH particularly suitable for dissecting intrinsic differences in adaptive immune responses between genotypes.

      We have now included pre-immune plasma controls (Figure 4 c, d), demonstrating that baseline antibody levels did not differ statistically between groups and were negligible prior to immunisation.

      As for the use of different plasma dilutions, this was necessary to ensure that all samples were measured within the linear detection range of our in-house ELISA. For example, after the primary immunisation, IgY concentrations were relatively low (e.g., day 5 post-immunisation), and plasma samples had to be diluted only 1:100 to detect measurable differences between groups. In contrast, after the booster immunisation, IgY concentrations increased substantially, and lower dilutions such as 1:100 led to signal saturation. Therefore, higher dilutions (up to 1:1600) were required to keep the values within the measurable range.

      Following the reviewer’s recommendation, we have now unified the presentation of results by showing data at a single representative dilution for each isotype: 1:100 for IgM (Figure 4C) and 1:1600 for IgY (Figure 4D). These dilutions fall within the linear part of the standard curve to distinguish between groups. We also calculated endpoint antibody titers, which confirmed that the observed differences remain statistically significant (p < 0.05).

      Regarding experimental replication, the study design already incorporated sufficient biological replication and longitudinal sampling to ensure robustness of the findings. Each experimental group consisted of ten animals, including three animals that served as negative controls. In addition, animals were sampled at multiple time points following immunisation, allowing the dynamics of the antibody response to be monitored over time. This longitudinal design provides repeated biological measurements within the same experimental cohort and allows confirmation of consistent response patterns across time points. All ELISA measurements were performed in technical triplicates. Together, the combination of adequate group size, appropriate controls, repeated sampling over time, and technical replication provides sufficient statistical power and internal validation of the observed effects. Furthermore, all animal experiments were conducted under strict approval of the Government of Upper Bavaria and in accordance with German animal welfare regulations, which limit unnecessary repetition of in vivo experiments beyond the approved experimental design.

      (2) The basic conundrum here and in later figures is never addressed by the authors: Situations where IFN type 1 and 3 signalling deficiency each have an independent effect (i.e., Figure 4d) suggest that they act by separate, unrelated mechanisms. However, all the literature about these IFN families suggests that they show almost identical signalling and gene induction downstream of their respective receptors. How can the same signalling, clearly active here downstream of the receptors for IFN type 1 or type 3, be non-redundant, i.e., why does the unaffected IFN family not stand in? This is a major difference from the mouse studies, which showed a rather subtle phenotype when only one of the two IFN systems was missing, but a massive reduction in virus control in double KO mice (the correct primary paper should be quoted here, not only the review by McNab). Reasons could be a direct effect of IFNab on B cells and an indirect effect of IFNL through non-B cells, timing issues, and many other scenarios can be envisaged. The authors do not address this question, which limits the depth of analysis.<br />

      We thank the reviewer for this insightful comment. Indeed, this represents one of the most interesting and novel findings of our study. Unlike in mice, where both type I and type III interferon systems need to be disrupted to observe clear susceptibility to influenza infection, in our chicken model the loss of IFNAR1 alone was sufficient to render the animals highly susceptible. This highlights a key difference between mammalian and avian interferon biology and supports the main goal of our work, to investigate the specific biological activities of avian interferons rather than directly transferring conclusions from mammalian systems.

      In relation to Figure 4d (anti-KLH IgY), we observed that both IFNAR1<sup>-/-</sup> and IFNLR1<sup>-/-</sup> animals reduced IgY levels compared to wild type at day 3 after the booster immunisation. However, by day 5 post-booster, IgY levels in IFNLR1<sup>-/-</sup> animals had returned to wild-type levels, while IFNAR1-/- animals still showed significantly lower IgY. This indicates that type III IFN contributes to the early phase of the IgY response but that its absence can later be compensated by type I IFN signalling. In contrast, loss of type I IFN cannot be compensated by type III IFN, suggesting that type I IFN plays a more dominant or sustained role in antibody induction.

      Although type I and type III IFNs share overlapping signaling pathways and induce similar sets of ISGs, their effects are not entirely redundant in chickens. A likely explanation is the difference in receptor distribution: IFNAR1 is broadly expressed across most cell types, while IFNLR1 expression is mainly confined to epithelial cells (Reuter et al. 2014, doi: 10.1128/jvi.02764-13; Santhakumar et al., 2017, doi: 10.3389/fimmu.2017.00049). This systemic versus localized receptor pattern likely determines the range of responsive cells and may account for the differential outcomes observed when either receptor is absent.

      Taken together, our findings indicate that while type I and type III IFNs share overlapping signaling mechanisms, they maintain distinct biological functions in chickens, consistent with their differing receptor expression and cellular responsiveness. This contrasts with mammalian models, where redundancy between these systems is more apparent and only double knockouts show strong phenotypes especially during influenza infection (Mordstein et al., 2008, doi: 10.1371/journal.ppat.1000151; Mordstein et al., 2010, doi: 10.1128/jvi.00272-10). We have now cited this primary study instead of the McNab review and expanded the Discussion to reflect this interpretation (Page 10, Line 463-467).

      (3) In the one in vivo experiment performed with chickens, only one virus was tested; more influenza strains should be included, as well as non-influenza viruses.

      We thank the reviewer for this valuable suggestion. The main objective of the present study was to generate and characterize novel chicken models lacking type I and type III interferon receptors in order to investigate their physiological relevance and to obtain the first insights into their roles during viral infection with more emphasis on avian influenza. As part of this manuscript, we performed detailed in ovo experiments using both influenza and non-influenza viruses (Figure 6). These included three influenza strains: H1N1, a mammalian-adapted strain; H3N1, a low pathogenic avian strain showing features of high pathogenicity; and H9N2, a low pathogenic avian strain, as well as a non-influenza virus, the infectious bronchitis virus (IBV). The in ovo analyses revealed clear strain-dependent modulation of interferon responses, and have provided a comprehensive overview of virus-specific interferon activity in chickens. The subsequent in vivo experiment was therefore designed as a proof of concept using the most suitable viral strain to robustly challenge the immune system and to identify the distinct functions of chicken interferons.

      (4) The basic conundrum of point 2 applies equally to Figure 6a; both KOs have a phenotype. Again in 6d, both IFNs appear to be separately required for Mx induction. An explanation is needed.

      We thank the reviewer for raising this important point. We have revised the Discussion (page 10, lines 442-454) and provided supporting references to clarify how the composition of the chorioallantoic membrane (CAM) and virus tropism together determine the apparent requirement for type I and type III interferons. The CAM contains both epithelial and mesodermal–vascular layers, which support complementary interferon functions: type I IFN acts mainly in systemic and vascular compartments, while type III IFN provides localized protection at the epithelial surface. Consequently, viruses that replicate in both compartments (e.g., WSN33, H3N1) require both IFN pathways for maximal Mx induction (Figures 6a, 6d), whereas viruses with a predominant or prolonged epithelial phase (e.g., H9N2, IBV) at the time point analyzed are effectively controlled by type I IFN signaling alone.

      These differences likely reflect virus-specific factors, including cell tropism, replication kinetics, and the spatial–temporal dynamics of receptor expression and signaling. Notably, our measurement of Mx expression at 24 hours post infection (hpi) may represent a phase when type I IFN signaling is dominant and can compensate for the absence of type III IFN. It remains possible that IFN-λ plays a more critical, non-redundant role at earlier stages post infection, when rapid antiviral protection is first required at the epithelial surface. Thus, the apparent redundancy observed at 24 hpi likely reflects temporal compensation and crosstalk between the IFN pathways rather than a lack of biological relevance for type III IFN.

      (5) Line 308, where are the viral titers you refer to in the text? The statement that the results demonstrate that excessive IFNab has a negative impact is overstretched, as no IFN measurements of the infected embryos are shown here.

      We thank the reviewer for this comment and would like to clarify that measurements of type I IFN (IFN-α/β) concentrations were indeed performed. The data are presented in Figure 6b and cited in the Results section (“Knockout of IFNAR1 and IFNLR1 did not affect IFN-α/β secretion in ovo”). To avoid misunderstanding, the Results section has been revised to explicitly reference the IFN-α/β measurements supporting this conclusion (line 302-309).

      These data indicate that all genotypes produced comparable IFN-α/β levels upon viral infection, with the IBV infection inducing approximately tenfold higher IFN-α/β secretion than the influenza strains tested (Figure 6b). The interpretation that an excessive type I IFN response can negatively affect host fitness is based on the combination of quantified IFN-α/β data (Figure 6b) and survival probability results (Supplementary Figure 10), where embryos exhibiting the highest IFN-α/β levels (embryos of all genotypes infected with IBV and embryos infected with IFNLR1<sup>-/-</sup> H9N2) showed the poorest survival despite moderate or low viral titers.

      (6) The in vivo infection is the most interesting experiment, and the key outcome here is that IFN type 1 is crucial for anti-H3N1 protection in chickens, while type 3 is less impactful. However, this experiment suffers from the different time points when chickens were culled, so many parameters are impossible to compare (e.g., weight loss, histopathology, IFN measurements, and more). Many of these phenomena are highly dynamic in acute virus infections, so disparate time points do not allow a meaningful comparison between different genotypes. What are the stats in 7b? Is the median rather than the mean indicated by the line? Otherwise, the lines appear in surprising places. SD must be shown, and I find it difficult to believe that there is a significant difference in weight, for e.g., IFNAR KO, unless maybe with a paired t test. What is the statistical test?

      We thank the reviewer for these thoughtful comments and agree that disease progression and sampling time can influence comparisons in acute infection studies. Hens were euthanized upon reaching predefined humane endpoint scores in full compliance with the Bavarian animal welfare regulations. Because the infection produced markedly different clinical kinetics among genotypes, all data were interpreted with reference to matched disease stages rather than absolute days post-infection.

      For matched comparisons: Viral titers in the trachea and cloaca, as well as plasma IFN-α/β concentrations, were compared between day 2 in IFNAR1<sup>-/-</sup> hens and day 3 in WT and IFNLR1<sup>-/-</sup> hens, which represent equivalent clinical stages before the sharp viral rise seen later in WT and IFNLR1<sup>-/-</sup> birds. At these comparable stages, viral titers were still low and IFN-α/β concentrations remained significantly lower in WT and IFNLR1<sup>-/-</sup> than in IFNAR1<sup>-/-</sup> hens (Figure 7c, d, f), indicating that uncontrolled viral replication and IFN-α/β secretion in the absence of type I signaling occur earlier and more intensely.

      For Figure 7b: Because chickens reached humane endpoints at different days post infection (2 dpi for IFNAR1<sup>-/-</sup> and 5–7 dpi for WT and IFNLR1<sup>-/-</sup>), statistical comparisons were performed within each genotype using paired t-tests and all data were shown together as mean ± SD.

      We acknowledge that unequal survival times limit direct temporal comparison. However, the consistent pattern across all parameters including early severe disease, high viral load, and excessive IFN-α/β secretion in IFNAR1<sup>-/-</sup> hens versus delayed onset in WT and IFNLR1<sup>-/-</sup>, supports the conclusion that type I IFN signaling is essential for early viral restriction and host survival, while type III IFN contributes mainly to localized inflammatory responses. The experiment cannot be repeated under the current animal welfare authorization.

      (7) Figures 7e,f: these comparisons are very difficult to interpret as the virus loads at these time points already differ significantly, so any difference could be secondary to virus load differences.

      We thank the reviewer for this valuable comment. We agree that viral load can influence interferon induction; however, our comparisons in Figures 7e and 7f were designed to reflect equivalent stages of disease progression rather than identical time points post-infection. For IFN-λ mRNA expression (Fig. 7e), spleens from IFNAR1<sup>-/-</sup> hens were sampled on day 2 post-infection, when viral titers were maximal, and compared to WT and IFNLR1<sup>-/-</sup> hens sampled on day 5 post-infection, at which point viral titers reached comparable levels. Thus, this comparison represents the phase of peak infection and systemic immune activation across all genotypes rather than an absolute temporal comparison.

      Similarly, for IFN-α/β concentrations (Fig. 7f), two levels of comparison were made: between IFNAR1<sup>-/-</sup> hens at day 2 post-infection (high viral titer) and WT and IFNLR1<sup>-/-</sup> hens at day 3 (low viral titer), and between WT and IFNLR1<sup>-/-</sup> hens at day 5 post-infection (high viral titer). In both cases, IFN-α/β levels remained disproportionately elevated in IFNAR1<sup>-/-</sup> hens, indicating that the excessive type I IFN response is primarily due to the loss of receptor-mediated feedback regulation rather than viral load alone.

      We have clarified this rationale in the legend of figure 7 and in the results (Line 338-345). We believe these results are valuable as they provide important insight into the temporal dynamics and regulatory interplay between type I and type III interferons during avian influenza infection.

      Reviewer #2 (Recommendations for the authors):

      Experiments need to be repeated. Comparisons in infection experiments must be done on the same day. More viruses need to be tested.

      We thank the reviewer for these constructive recommendations. All infection experiments were conducted under approved animal welfare regulations, which limited the number of replicates and prevented repeating in vivo challenges beyond the authorized design, in line with the 3R principles, particularly Reduction, to avoid unnecessary animal use. To ensure comparability, samples were analyzed at matched disease stages rather than identical time points, as clarified in the revised figure legends (figure 7) and Results (Line 338-345). The study already includes multiple influenza and non-influenza viruses (H1N1, H3N1, H9N2, and IBV) tested in ovo to capture virus-specific interferon responses, while the in vivo H3N1 infection served as a proof-of-concept to dissect genotype-specific immune dynamics.

    1. eLife Assessment

      This important study implicates that changes in cell regulation may contribute to the evolution of multicellularity. The evidence supporting the conclusions is convincing, with rigorous methods used to test alternative hypotheses. The work will be of broad interest to cell and evolutionary biologists and those studying the cell cycle and cancer.

    2. Reviewer #1 (Public review):

      Summary:

      Ducrocq et al. present research exploring the genetic link between simple multicellular group formation (ace2Δ/ace2Δ) and its interaction with cell-cycle progression mutants (e.g., cln3Δ/cln3Δ), demonstrating that this combination can provide fitness benefits during fluctuating resource conditions, resulting in a rapid increase in the fraction of multicellular cell-cycle mutants over unicellular yeast without selection for multicellular size. Because both the multicellular phenotype and the regulatory link enabling faster escape from the stationary phase are controlled by the ACE2 transcription factor, this work demonstrates that multicellular cluster formation can arise as a side effect of a completely independent fitness advantage unrelated to the benefits of group formation itself. As a "passenger phenotype," multicellularity could thus emerge for other selective reasons, potentially facilitating a later transition to more entrenched multicellularity if novel conditions arise that make multicellular group formation directly beneficial.

      Importantly, while the literature generally assumes that multicellular group formation incurs a cell-level fitness cost, this work demonstrates that certain genetic - environmental interactions can confer fitness benefits even at the level of individual cells forming multicellular groups. This finding should inspire both theoretical and empirical work exploring multicellular group formation selected for benefits at the level of individual cells, rather than the benefits of forming a larger organismal size that most work has relied on so far.

      Strengths:

      This work is novel and exciting for research exploring the very first steps of the transition from unicellularity to simple multicellularity. The formation of multicellular groups is almost always assumed to come at a cell-level fitness cost due to reduced reproductive fitness compared to remaining unicellular, which generally needs to be outweighed by the benefits of multicellular group formation (e.g., large size to escape predation) for the multicellular phenotype to be stable. However, this study presents an interesting case of a genetic and environmental condition under which individual cells forming simple multicellular clusters can actually have higher reproductive fitness than solitary living yeast cells. This contrasts with previous snowflake yeast studies where the multicellular phenotype was primarily beneficial due to strong selection for large groups (rather than cell-level fitness gains).

      The claims and interpretation of the results align well with the data presented. This is due to the careful and straightforward experimental design testing predictions with a clear, stepwise methodology. The authors rule out alternative explanations and provide support for the proposed link between the mutations (ace2, cln3, and others), their impact on faster exit from quiescence and earlier entry into reproduction in fresh media, and the resulting higher fitness in the snowflake yeast phenotype compared to unicellular yeast.

      This experimental framework (combining cell-cycle mutants under the same multicellular background) is very much likely to be adopted by others in the community to explore downstream implications of these results in laboratory and environmental yeast isolates.

      Weaknesses:

      The authors show that the same multicellular phenotype with higher cell-level fitness due to faster exit from the stationary phase can also be observed with alleles found at other loci in non-laboratory yeast strains, implying that the results are likely not specific to a peculiar case genetically engineered in laboratory strains, but that similar phenotypes may be present in nature. However, this remains to be explored by examining the natural ecology of commercially available or wild yeast isolates and their genomes. This is not a weakness of this study per se, but rather a direction for future work. It does mean, however, that the relevance of these findings for early multicellularity in yeast, and even more so for nascent multicellularity in distinct taxa, remains to be explored in the future. Until then, it is difficult to make strong claims about how applicable these results would be for non-laboratory yeast and other taxa. Regardless, this work represents a very exciting finding.

      Comments on revised version:

      The authors addressed all concerns thoroughly.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Ducrocq et al. present research exploring the genetic link between simple multicellular group formation (ace2Δ/ace2Δ) and its interaction with cell-cycle progression mutants (e.g., cln3Δ/cln3Δ), demonstrating that this combination can provide fitness benefits during fluctuating resource conditions, resulting in a rapid increase in the fraction of multicellular cell-cycle mutants over unicellular yeast without selection for multicellular size. Because both the multicellular phenotype and the regulatory link enabling faster escape from the stationary phase are controlled by the Ace2 transcription factor, this work demonstrates that multicellularity can arise as a side-effect of a completely independent fitness advantage unrelated to the benefits of group formation itself. As a "passenger phenotype," multicellularity could thus emerge for other selective reasons, potentially facilitating a later transition to more entrenched multicellularity if novel conditions arise where group formation becomes directly beneficial.

      Strengths:

      This work is novel and exciting for research exploring the very first steps of the transition from unicellularity to simple multicellularity. This is particularly significant because the formation of multicellular groups is almost always assumed to come at a cell-level fitness cost due to reduced reproductive fitness compared to remaining unicellular. This cell-level fitness cost generally needs to be outweighed by the benefits of multicellular group formation (e.g., large size escaping predation) for the multicellular phenotype to be stable, which is true for a large number of cases studied in the literature, where the multicellular phenotype can only evolve over unicellular competitors under strong selection for multicellular groups. However, this study presents an interesting case of a genetic and environmental condition under which individual cells (forming simple multicellular clusters) can actually have higher reproductive fitness than unicellular yeast. This demonstrates that the assumed cost at the single-cell level does not always apply. In summary, this work represents a unique example contrary to common assumptions regarding the costs of multicellular phenotypes, showing that simple multicellular phenotypes can evolve and remain stable without requiring strong selection for multicellular size or other benefits of group formation.

      The claims and interpretation of the results align well with the data presented. This is due to the careful and straightforward experimental design testing predictions with a clear, stepwise methodology, ruling out alternative explanations and providing support for the proposed link between the mutations (ace2, cln3, and others), their impact on faster exit from quiescence, and thus earlier entry into reproduction in fresh media, resulting in higher fitness in the snowflake yeast phenotype compared to unicellular yeast.

      Weaknesses:

      The authors show that the same multicellular phenotype with higher cell-level fitness due to faster exit from the stationary phase can also be observed with alleles found at other loci in non-laboratory yeast strains, implying that the results are likely not specific to a peculiar case genetically engineered in laboratory strains, but that similar phenotypes may be present in nature. However, this remains to be explored further by examining the natural ecology of commercially available or wild yeast isolates and their genomes. This is by no means a weakness of this study and, therefore, not necessarily something the current work can improve. It does mean, however, that the relevance of these findings for early multicellularity in yeast, and even more so for nascent multicellularity in distinct taxa, remains to be explored in the future. Until then, it is difficult to make strong claims about how applicable these results would be for non-laboratory yeast and other taxa. Regardless, this work does its part by representing a very exciting finding.

      Reviewer #2 (Public review):

      Summary:

      Here, the authors attempt to demonstrate that a simple model of multicellularity - snowflake yeast - exhibits key ecologically relevant changes in the regulation of the cell cycle. By examining the effects of the ace2 mutation in environments where multicellularity is not directly selected for or against, and combining it with mutations in key cell cycle regulators, they hope to show that mutations driving simple multicellularity can be selectively favored due to their effects on the release from quiescence rather than their effects on multicellularity itself.

      Strengths:

      The experiments performed are extensive and thorough. The yeast genotypes examined are judiciously chosen, so as to map out a functional model of the relationship between alterations to cell cycle control and changes to multicellularity phenotypes. Multiple possible interactions are examined, with the causal link and model of the relationship between the multicellular passenger phenotype and the selectable quiescence-release phenotype being well-supported. There are extensive controls demonstrating the separation between the 'passenger' multicellular phenotype and the cell cycle regulation phenotypes examined, including haploid/diploid strains with different multicellular phenotypes but similar cell cycle regulation phenotypes, and phenocopy strains in which downstream enzymes are deleted rather than key central regulators.

      Weaknesses:

      My only concerns about these results relate to the focus on selection on cell cycle control being examined in a model of multicellularity with key core cell cycle mutations rather than in a wild-type background, as this is a somewhat artificial system.

      I believe, however, that the authors convincingly make their case that this work on the multicellular phenotypes of yeast represents a potent proof-of-concept that simple multicellularity can be driven into existence or selected for as a passenger phenotype due to pleiotropic effects of mutations under selection from real-world ecological pressures. They are able to connect this phenotype back to known mutations of particular cell cycle regulators (RB) in other multicellular lineages and demonstrate that ecologically relevant changes to the cell cycle are connected to multicellular phenotypes. As a proof of concept of the connection between these phenotypes, rather than a study of a particular event in the past of a living lineage, it makes a strong case.

      A longstanding question in the field of multicellularity is the selective pressures that can drive simple multicellularity into existence and then act on simple multicells to drive their increased size and complexity. This work brings to the table tangible evidence of the possibility that, instead of being selected for on its own, simple multicellularity can be a side-effect of selection on other key phenotypes.

      This separates the question of the origins of multicellularity and the forces that drive its further evolution. This separation can reframe how the field is studied, especially in the context of the apparent dichotomy between dozens of origins of 'simple' multicellularity across the tree of life and a few origins of 'complex' multicellularity in the history of Earth. Especially in light of other evidence that multicellularity is connected to changes in cell cycle regulation, I believe that this is an important insight that will alter the way we think about the origins of this key evolutionary transition.

      We thank the reviewers for their insightful comments on our work.

      We agree with reviewer #1 that further experiments would be needed to figure out how the observations done on lab strains can apply to yeast in various ecological conditions and particularly in the wild. We here provide a proof of principle that multicellularity selection can arise as a side-effect. It obviously does not prove that it took place during yeast evolution, but we would like to emphasize that resource fluctuations are very common in ecological conditions, making it highly likely that the environmental conditions necessary for the selection of the side effects described have arisen.

      We agree with reviewer #2 that our work on yeast strains is “somewhat artificial” as often the case with model organisms under laboratory conditions. Importantly though, we showed that the effect found with the cln3 knock-out mutation can be phenocopied by overexpression of WHI5 (encoding the yeast equivalent of Rb). We propose that variations in the levels of cell cycle regulators during evolution may have played a role in multicellularity selection as a side effect. We agree that this is merely a hypothesis to explain the selection of multicellularity (just like predator escape) and that there is no direct evidence that this occurred in the history of the lineage. Nevertheless, our work provides a first evidence that such a selection of multicellularity as a side effect could be possible, and gives a framework to understand how multicellularity can persist in the wild, even when it is not the primary target of selection.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      As mentioned in my public review, I very much appreciate this work, its interpretation for early multicellularity as an example opposite to the assumed cost of multicellular phenotypes, and the robust design behind the premise and claims. Therefore, my suggestions below are mostly aimed at improving the readability and data presentation.

      (1) In the abstract, Lines 24-27 (the last sentence): This statement is worded too generally and therefore reads as too strong. I think the authors' work provides an example that multicellularity itself does not need to be beneficial all the time - this is really exciting and makes sense! However, there is a substantial body of work showing the origin and maintenance of multicellularity for its direct benefits. Relative to that body of work, this represents a special case, and therefore, while we should definitely reconsider the view that "multicellularity always comes at a cell-level fitness cost," we cannot overgeneralize these findings. Please consider reframing this statement.

      Done, now line 25 (addition of “in some cases”)

      (2) Line 48 (Introduction): "This mostly concerns two major regulators, RB and Cyclin D." Which organisms are you referring to? Please specify.

      Done.

      (3) In the Introduction, there are at least three sentences that need citations: L57-58, L59-60, and L65. For instance, I do not know what makes CLN3 the yeast functional equivalent of RB, and I wanted to verify this claim, but no references are cited. Please ensure citations are provided throughout the manuscript.

      Done: ref 11,12 and 13 were added

      (4) This is my main request regarding data collection and presentation. The authors share some microscopy images of mutant strains in Figure 2 for different purposes (e.g., Figure 2B compares the fraction of budded cells between two genotypes). However, I would appreciate seeing a collected microscopy figure showcasing the phenotypes of all genotypes that went into competition experiments, including the planktonic (WT lab strain) yeast, either where they appear or in a supplementary figure, all presented with the same magnification and scale to make them comparable. Because cell size, shape, and multicellular phenotype are all key aspects of the competition experiments, being able to see all those genotypes/phenotypes would prepare the reader to make predictions about the fitness assays and other experiments.

      Done Supplementary Figure 1 B-E were added

      (5) Related to my previous point, I would appreciate seeing cell size measurements for the different genotypes (both single cells of planktonic genotypes and single cells forming multicellular clusters). Cell size is a key trait that directly impacts the results shown in the paper, and summary statistics comparing them would be helpful for interpreting the results.

      Done Supplementary Figure 1 F was added

      (6) In competition experiments, the authors mix unicellular and multicellular yeast clusters at 50/50 and measure the fraction of a phenotype of interest (usually the % of snowflake). It took me a while to understand what is being counted under the "% snowflake yeast" category. This is because, while each cell in unicellular yeast should be counted as one unit, one can count a snowflake yeast composed of 50 cells as 50 units or as 1 unit. Please clearly state what is being counted for the Y-axis labeled "% of snowflake yeast" (or relabel those Y-axes in plots to make this clear).

      Done: Added in figure legend 1A and Y-axes of competition figures

      (7) I recommend editing the genotype labels in figures (see, for instance, Figure 1B, C, D). In Figure 1B, the bars are labeled as "CLN3/CLN3 co-culture" or "cln3Δ/cln3Δ co-culture," etc. These are actually co-cultures of SF vs. PK (with or without a CLN3 copy). Please consider using more representative labels that will be easier for readers to understand.

      Done: this has been changed in all concerned figures

      (8) In the Results, L225, you begin referring to AMN1368D as AMN1. I suggest using the full allelic form throughout the text so it will be clear each time that you are referring to that specific allele, as I was confused about whether you were discussing the allele or the gene AMN1 itself.

      This has been changed throughout the text.

      (9) Discussion, Lines 250-252, states that this is a "situation that is likely to happen very often under ecological conditions." Are there any examples you can cite?

      Done, as also requested by reviewer #2 (now line 256-7)

      (10) Lines 272-275 contain a strong, general statement suggesting that co-evolution of cell cycle regulation and multicellularity could be more general (which is acceptable as speculation). However, the suggestion that this co-evolution could have "started very early in the evolution of eukaryotic cells" is too speculative. I would recommend sticking with the alternative, suggesting that the link between the two phenotypes may be a case of convergent evolution.

      Done

      (11) Lines 278-279 are both vague and too bold. The text mentions a link between cancer and multicellularity and then extends this link through cell cycle regulators. Without explaining the connection between cancer and multicellularity and then trying to link it to cell cycle regulators, all in a few words without background, this sentence is too vague. Please consider deleting this or spending more time clearly explaining the link, which would at best still be speculative.

      These speculative sentences were removed.

      (12) First, I wanted to note that I highlighted Lines 284-287, as this passage is clearly written and provides a nice argument. I also wonder if you could mention that your work shows simple multicellular cluster formation should not always come at a cost, contrary to the general assumption in the literature, and add a few citations to support that claim. This would highlight how significant this work is within the broader multicellularity literature.

      Changed in discussion (now line 242-4 with additional references 30 and 31)

      (13) I recommend labeling the genotype of your "quintuple mutant" in Figure 3. You can refer to it as the quintuple mutant in the text, but I had to go back and forth to see what those mutations were when trying to think about potential genetic interactions. Even the legend of Figure 3 does not specify the genotype and refers to it only as the "quintuple mutant."

      Now explicitly stated in the title of the figure

      Reviewer #2 (Recommendations for the authors):

      I find the presented research to be of high quality, with very important implications. I have suggestions for improvement of the manuscript, but they are largely stylistic, with one paper that I believe deserves citation regarding the proteins involved. I see little need for additional experiments or analysis, just a clearer description of the results and their significance.

      (1) Line 62: Yeast CLN3 definitely performs the same role as cyclin D in the cell cycle, but has an unclear phylogenetic relationship with the rest of the cyclins. See Cross, Buchler, & Skotheim 2011 ("Evolution of networks and sequences in eukaryotic cell cycle control"). This reference also covers the functional relationship between RB and Whi5, referred to in nearby sentences, as does Medina, Walsh, and Buchler 2019 ("Evolutionary innovation, fungal cell biology, and the lateral gene transfer of a viral KilA-N domain").

      The reference has been added

      (2) Line 69: Is the question whether the evolution of G1/S regulation favoring multicellularity the question, or the two of them being connected such that the evolution of one can affect the other?

      It is clearly the first of the two questions.

      (3) Line 73: Comma after Ace2.

      Done

      (4) Line 76: It would be clearer to specify that snowflake and ACE2 yeast were co-cultured without settling selection or other selection that explicitly favors multicellularity, unlike in experiments where multicellular evolution is observed, as in Ratcliff publications.

      This is now specified.

      (5) Line 80: Specify which phenotypes observed for ace2 mutants are observed, specifically, both the multicellularity and the release from quiescence.

      Done

      (6) Line 146: This observation should be noted as another indication that the multicellular phenotype is not behind the selective pressure, because it is so different between unicells and multicells.

      Overall, you have very strong evidence that this is the case, and emphasizing this would benefit the paper!

      Done.

      (7) Line 151: specify that you are maintaining yeast in proliferation in coculture.

      Done.

      (8) Line 181: This is another key experiment showing that the multicellular phenotype is not the causal reason for the change in quiescence. It might make things clearer to bring all these confirmatory experiments together, particularly the haploids and the sonicated single cells.

      This is now clearly stated line 195.

      (9) Line 225: The choice of referring to the non-laboratory strain as the 'AMN1' wild type default may be confusing to readers, who may treat the genetic background you are using as the ground truth wild type. I recommend throughout the paper always specifying the allele's amino acid to avoid any confusion.

      The genotype is now clearly presented throughout the text.

      (10) Line 238: I would continue to specify that the multicellular phenotype has no selective advantage, specifically when no selection for size is applied.

      See added sentence Line 242-4 (revised version)

      (11) Line 243: I would say that the evolution of cell cycle regulation may interact with the multicellular phenotype.

      This was changed (now line 248)

      (12) Line 244: Strike 'indeed' and the 'the' before AMN1 and ACE2.

      Done

      (13) Line 252: Suggest some ecological conditions under which quiescence exit is likely, such as boom and bust or moving from rotting fruit to rotting fruit.

      Done

      (14) Line 267: Are you suggesting that the specific genes AMN1 and ACE2 had particular effects on actual organisms in the past, or that it represents a broad pattern of evolution in which multicellularity could be more broadly related to exit from quiescence? I believe it is the latter, and I think that should be clearer.

      Modified as suggested

      (15) Line 280: In this paragraph, I think that the point being made could be slightly clearer - if I am not mistaken, you are making the distinction between the appearance of multicellularity and its refinement under selection, and that the former may be more common than previously believed, given this proof of concept. I think this can be made clearer. Furthermore, it is worth noting that all experiments that show effects of the multicellular phenotype are in mutant backgrounds, and explaining why this is still relevant to wild organisms. It might be taken by some as indicating that the multicellular phenotypes are not relevant to a wild population, but the connection to known RB mutations in known multicellular lineages and the fact that it is connected to a very key aspect of cell cycle regulation, I think, overcomes this issue, and this should be made clear.

      Our study reveals a genetic link between multicellularity and Whi5 and Cln3, two important G1/S cell cycle regulators. Similar genetic interactions have been observed in phylogenetically distant species, reinforcing the idea that the interplay between cell cycle regulation and multicellularity is a general feature and not a mere artifact of mutant background.

      The neutral fitness effect of multicellularity in wild-type backgrounds is particularly of interest. By being maintained as a side effect of selection on fundamental cellular processes, the neutral effect of multicellularity may have provided “an evolutionary scheme” for its repeated emergence throughout the tree of life. As such, the "passenger selection" hypothesis fits well with the observations of phenotypic reversibility and facultative multicellularity, despite varying and specific selective pressures. Our work thus gives a framework to understand how multicellularity can persist in the wild, even when it is not the primary target of selection.

      (16) Line 314: What promoters are they driven by?

      Specified

      (17) Line 336: What was the culture volume, and the volume transferred?

      Specified

      (18) Line 362: How was the proportion of blue-stained cells scored? Manually, or with an imaging software cutoff?

      Specified

      (19) Figure 1: I think that the full genotypes of each strain should be specified, either in the legend or the key of the figure, rather than always specifying the ACE2 genotype and other mutations separately.

      Done as requested by reviewer #1

      (20) Figure 2E, 2F: Same as Figure 1, regarding genotypes.

      Done

    1. eLife Assessment

      This important study demonstrates that paternal diet influences not only testicular morphology but also placental and fetal development, supporting a role for paternal contributions to offspring health. The study also considers potential links between the microbiome and male reproductive health. By combining transcriptomic and histological analyses across multiple tissues, the evidence supporting the central conclusions of the study is convincing.

    2. Reviewer #1 (Public review):

      Summary:

      Morgan et al. studied how paternal dietary alteration influenced testicular phenotype, placental and fetal growth using a mouse model of paternal low protein diet (LPD) or Western Diet (WD) feeding, with or without supplementation of methyl-donors and carriers (MD). They found diet- and sex-specific effects of paternal diet alteration. All experimental diets decreased paternal body weight and the number of spermatogonial stem cells, while fertility was unaffected. WD males (irrespective of MD) showed signs of adiposity and metabolic dysfunction, abnormal seminiferous tubules and dysregulation of testicular genes related to chromatin homeostasis. Conversely, LPD induced abnormalities in the early placental cone, fetal growth restriction and placental insufficiency, which was partly ameliorated by MD. The paternal diets changed placental transcriptome in a sex-specific manner and led to a loss of sexual dimorphism in the placental transcriptome. These data provide a novel insight on how paternal health can affect the outcome of pregnancies, which is often overlooked in prenatal care.

      Strengths:

      The authors have performed a well-designed study using commonly used mouse models of paternal underfeeding (low protein) and overfeeding (Western diet). They performed comprehensive phenotyping at multiple timepoints including of the fathers, the early placenta and late gestation feto-placental unit. The inclusion of both testicular and placental morphological and transcriptomic analysis is a powerful non-biased tool for such exploratory observational studies. The authors describe changes in testicular gene expression revolving around histone (methylation) pathways that are linked to altered offspring development (H3.3 and H3K4), which is in line with hypothesised paternal contributions to offspring health. The authors report sex differences in control placentas that mimic those in humans, providing potential for translatability of the findings. The exploration of sexual dimorphism (often overlooked) and its absence in response to dietary modification is novel and contributes to the evidence-base for the inclusion of both sexes in developmental studies.

      Comments on revised version:

      The authors have done a great job addressing my concerns. The description of the data analysis and the figures are now much clearer. The inclusion of the potential links between the microbiome and male reproductive fitness is informative and improves the flow of the discussion.

    3. Reviewer #2 (Public review):

      Summary:

      The authors investigated the effects of a low-protein diet (LPD) and a high sugar- and fat-rich diet (Western diet, WD) on paternal metabolic and reproductive parameters and feto-placental development and gene expression. They did not observe significant effects on fertility; however, they reported gut microbiota dysbiosis, alterations in testicular morphology, and severe detrimental effects on spermatogenesis. In addition, they examined whether the adverse effects of these diets could be prevented by supplementation with methyl donors. Although LPD and WD showed limited negative effects on paternal reproductive health (with no impairment of reproductive success), the consequences on fetal and placental development were evident and, as reported in many previous studies, were sex-dependent.

      Strengths:

      This study is of high quality and addresses a research question of great global relevance, particularly in light of the growing concern regarding the exponential increase in metabolic disorders, such as obesity and diabetes, worldwide. The work highlights the importance of a balanced paternal diet in regulating the expression of metabolic genes in the offspring at both fetal and placental levels. The identification of genes involved in metabolic pathways that may influence offspring health after birth is highly valuable, strengthening the manuscript and emphasizing the need to further investigate long-term outcomes in adult offspring.

      The histological analyses performed on paternal testes clearly demonstrate diet-induced damage. Moreover, although placental morphometric analyses and detailed histological assessments of the different placental zones did not reveal significant differences between groups, their inclusion is important. These results indicate that even in the absence of overt placental phenotypic changes, placental function may still be altered, with potential consequences for fetal programming.

      Comments on revised version:

      The authors have adequately addressed all my previous comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Morgan et al. studied how paternal dietary alteration influenced testicular phenotype, placental and fetal growth using a mouse model of paternal low protein diet (LPD) or Western Diet (WD) feeding, with or without supplementation of methyl-donors and carriers (MD). They found diet- and sex-specific effects of paternal diet alteration. All experimental diets decreased paternal body weight and the number of spermatogonial stem cells, while fertility was unaffected. WD males (irrespective of MD) showed signs of adiposity and metabolic dysfunction, abnormal seminiferous tubules, and dysregulation of testicular genes related to chromatin homeostasis. Conversely, LPD induced abnormalities in the early placental cone, fetal growth restriction, and placental insufficiency, which were partly ameliorated by MD. The paternal diets changed the placental transcriptome in a sex-specific manner and led to a loss of sexual dimorphism in the placental transcriptome. These data provide a novel insight into how paternal health can affect the outcome of pregnancies, which is often overlooked in prenatal care.

      Strengths:

      The authors have performed a well-designed study using commonly used mouse models of paternal underfeeding (low protein) and overfeeding (Western diet). They performed comprehensive phenotyping at multiple timepoints, including the fathers, the early placenta, and the late gestation feto-placental unit. The inclusion of both testicular and placental morphological and transcriptomic analysis is a powerful, non-biased tool for such exploratory observational studies. The authors describe changes in testicular gene expression revolving around histone (methylation) pathways that are linked to altered offspring development (H3.3 and H3K4), which is in line with hypothesised paternal contributions to offspring health. The authors report sex differences in control placentas that mimic those in humans, providing potential for translatability of the findings. The exploration of sexual dimorphism (often overlooked) and its absence in response to dietary modification is novel and contributes to the evidence-base for the inclusion of both sexes in developmental studies.

      Weaknesses:

      The data are overall consistent with the conclusions of the authors. The paternal and pregnancy data are discussed separately, instead of linking the paternal phenotype to offspring outcomes. Some clarifications regarding the methods and the model would improve the interpretation of the findings.

      (1) The authors insufficiently discuss their rationale for studying methyl-donors and carriers as micronutrient supplementation in their mouse model. The impact of the findings would be better disseminated if their role were explained in more detail.

      We acknowledge the Reviewer’s comments regarding the amount of detail in support of the inclusion of methyl carriers and donors within our diet. Therefore, we will revise the manuscript to include more justification, especially within the Introduction section, for their inclusion. Please see lines 111-120.

      (2) It is unclear from the methods exactly how long the male mice were kept on their respective diets at the time of mating and culling. Male mice were kept on the diet between 8 and 24 weeks before mating, which is a large window in which the males undergo a considerable change in body weight (Figure 1A). If males were mated at 8 weeks but phenotyped at 24 weeks, or if there were differences between groups, this complicates the interpretation of the findings and the extrapolation of the paternal phenotype to changes seen in the fetoplacental unit. The same applies to paternal age, which is an important known factor affecting male fertility and offspring outcomes.

      We thank the Reviewer for their comments regarding the ages of the males analysed. As we had 5 treatment groups, and intended to generate a minimum of 8 litters of offspring per treatment group, this resulted in over 40 litters in total. In order to dissect these litters appropriately, and in a timely fashion, we had to stagger their generation over time. As such, this resulted in utilising our males at different ages/durations on the diet. However, in all our statistical analysis, we factored in the duration of time on the diet, which also acted as a proxy measure of paternal age. We also ensured that we staggered the generation of litters in each diet group so that any age effects were experienced across all paternal regimens.

      We have revised the manuscript to acknowledge this fact and to highlight that the duration of time on any diet was factored into the statistical analysis.

      (3) The male mice exhibited lower body weights when fed experimental diets compared to the control diet, even when placed on the hypercaloric Western Diet. As paternal body weight is an important contributor to offspring health, this is an important confounder that needs to be addressed. This may also have translational implications; in humans, consumption of a Western-style diet is often associated with weight gain. The cause of the weight discrepancy is also unaddressed. It is mentioned that the isocaloric LPD was fed ad libitum, while it is unclear whether the WD was also fed ad libitum, or whether males under- or over-ate on each experimental diet.

      We agree with the Reviewer that the general trend towards a lighter body weight for our experimental animals is unexpected. We can confirm that all diets were fed ad libitum. However, as males were group housed, we were unable to measure food consumption for individual males. We also observed that for males fed the high fat diets, they often shredded significant quantities of their diet, rather than eating it, so preventing accurate measurement of food intake.

      We also agree with the Reviewer that body weight can be a significant confounder for many paternal and offspring parameters. However, while the experimental males did become lighter, there were no statistical differences between groups in mean body weight. As such, body weight was not included as a variable within our statistical analysis.

      (4) The description and presentation of certain statistical analyses could be improved.

      (i) It is unclear what statistical analysis has been performed on the time-course data in Figure 1A (if any). If one-way ANOVA was performed at each timepoint (as the methods and legend suggest), this is an inaccurate method to analyse time-course data.

      (ii) It is unclear what methods were used to test the relative abundance of microbiome species at the family level (Figure 2L), whether correction was applied for multiple testing, and what the stars represent in the figure. 3) Mentioning whether siblings were used in any analyses would improve transparency, and if so, whether statistical correction needed to be applied to control for confounding by the father.

      We apologies for the lack of clarity regarding the statistical analyses. Going forward, we will revise the manuscript and include a more detailed description of the different analyses, inclusion of siblings and correction for multiple testing.

      Reviewer #1 (Public review):

      Summary:

      The authors investigated the effects of a low-protein diet (LPD) and a high sugar- and fat-rich diet (Western diet, WD) on paternal metabolic and reproductive parameters and fetoplacental development and gene expression. They did not observe significant effects on fertility; however, they reported gut microbiota dysbiosis, alterations in testicular morphology, and severe detrimental effects on spermatogenesis. In addition, they examined whether the adverse effects of these diets could be prevented by supplementation with methyl donors. Although LPD and WD showed limited negative effects on paternal reproductive health (with no impairment of reproductive success), the consequences on fetal and placental development were evident and, as reported in many previous studies, were sex-dependent.

      Strengths:

      This study is of high quality and addresses a research question of great global relevance, particularly in light of the growing concern regarding the exponential increase in metabolic disorders, such as obesity and diabetes, worldwide. The work highlights the importance of a balanced paternal diet in regulating the expression of metabolic genes in the offspring at both fetal and placental levels. The identification of genes involved in metabolic pathways that may influence offspring health after birth is highly valuable, strengthening the manuscript and emphasizing the need to further investigate long-term outcomes in adult offspring.

      The histological analyses performed on paternal testes clearly demonstrate diet-induced damage. Moreover, although placental morphometric analyses and detailed histological assessments of the different placental zones did not reveal significant differences between groups, their inclusion is important. These results indicate that even in the absence of overt placental phenotypic changes, placental function may still be altered, with potential consequences for fetal programming.

      Weaknesses:

      Overall, this manuscript presents a rich and comprehensive dataset; however, this has resulted in the analysis of paternal gut dysbiosis remaining largely descriptive. While still valuable, this raises questions regarding why supplementation with methyl donors was unable to restore gut microbial balance in animals receiving the modified diets.

      We thank the Reviewer for their considered thoughts on the gut dysbiosis induced in our models the minimal impact of the methyl donors and carriers. We will include additional text within the Discussion to acknowledge this. However, at this point in time, we are unsure as to why the methyl donors had minimal impact. It could be that the macronutrients (i.e. protein, fat, carbohydrates) have more of an influence on gut bacterial profiles than micronutrients. Alternatively, due to the prolonged nature of our feeding regimens, any initial influences of the methyl donors may become diluted out over time. We will amend the text to reflect these potential factors.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have done an immense amount of work, which should be commended. In addition to the public review, I have a few suggestions for improvement.

      (1) To further explore the weight discrepancy between the males subjected to diet alteration and those on the control diet, further details about the intake and provision of the diets would be beneficial. Seeing as the fat mass was increased in males fed a WD, do you have information on where the weight 'loss' originated from?

      We thank the Reviewer for their insight into the changes in male body weight. We agree that the differences in total body weight verses the amount of adipose tissue, is intriguing. Unfortunately, we were unable to monitor the food intake of our animals for two main reasons. The first was that for animal welfare considerations, all our males were initially group housed prior to mating. This meant that typically, males were housed in groups of 4 during the initial feeding (pre-mating) period. Males were only housed singly upon them being used for mating. As such, it was not possible to obtain food consumption data for individual males.

      A second limitation arose due to the high extend of males who were fed the Western Diet effectively shredding the diet. This meant that it was not possible to weight the food to obtain a crude idea of how much they were consuming. The reason for this shredding is not clear to us. All mice received environmental enrichment, as we did not observed this behaviour for our control or low protein diet fed males.

      With regards to the weight of the other organs, we did not observe and significant overall changes in organ weight, or weight relative to body weight. Unfortunately, we did not have access to, or conduct any whole body scanning, such as DEXA, which would have given more insight into the body composition of our mice.

      (2) The testicular abnormalities and gene expression findings are linked nicely to the offspring's story. This is not as compelling for other findings, including the gut microbiome changes, which are not discussed in the context of the fetoplacental outcomes. More discussion of the potential impact of paternal changes on fetal outcomes would strengthen claims that these findings are impactful.

      We thank the Reviewer for their comments and suggestion. Our caution with connecting the gut microbiota to offspring development is that, to the best of our understanding, there is little data with regards to its effect on post-fertilisation development. While there is data showing that the microbiome can produce compounds and metabolites that can affect sperm quality and metabolism, lipid composition and testicular morphology, the connection with post-fertilisation development is limited. Additionally, as we saw no difference in fundamental fertility, as measured by changes in litter size, we propose that there no overall changes in the ability of the sperm from our experimental males to reach, fertilise and support development.

      However, we acknowledge the Reviewers comments on strengthening the manuscript and so have included some additional text within the Discussion to highlight the links between the microbiome and male reproductive fitness. Please see lines 337-348.

      (3) It is clarified in the methods that n=8 males were used in the study, but different nnumbers are shown for some parameters. It would improve transparency for the reader if it were clarified whether these differences result from missing data or from the removal of statistical outliers.

      The Reviewer is correct that while 8 males were initially placed on their respective diets, for some of the analyses, the n-number is less than 8. In some instances, for example the analysis of total body fat (Fig. 1D), data was unfortunately not collected during an initial round of dissections. As such, the n number here is only 6 in each group. Additionally, due to the high cost associated with sequencing the microbiome for 5 groups, we decided to only sequence 6 samples per group. However, we do not feel that this impacts significantly on the overall focus of the data presented.

      (4) Despite this, you may have been underpowered to detect differences in some parameters, for example, the placental stereology. Alternative approaches, such as immunostaining with whole-section quantification, may be more sensitive to detect subtle changes. Alternatively, have you considered using smaller grids for improved sensitivity of the stereological analysis?

      We thank the Reviewer for their insight into the data and their suggestion for immunostaining. We agree with the Reviewers that a greater number of samples would have strengthened our analyses. However, we are not in the possession of further samples which have been processed in the correct manner for additional stereological analysis. We are hoping to conduct further placental analyses based on our RNA-Seq data, but this will require the generation of new samples.

      (5) It would be easier to interpret the figures if it were clear which datasets were analysed using non-parametric tests. Were Figure 2F, 2G, 6A, 6E, and 6I are shown differently for that reason, perhaps? It would improve transparency if non-normally distributed data are shown as medians, as that's what's being compared in a non-parametric test.

      We apologies for any confusion regarding the analysis of our data. The Reviewer is correct that the data in 2F and 2G were analysed using a non-parametric test. We have now made this clearer in the legend to the figure highlighting which data sets were analysed by ANOVA or Kruskal–Wallis test. We have also done this for the other figure legends where appropriate. With regard to Figure 6, the data presented in Panels A, E and I were intended to show the range of data extending above and below the 90th and 10th centiles of the CD fetuses. As such, we felt that violin plots were the most appropriate way to display these data.

      (6) Supplemental Figure 1 seems to be missing.

      We apologise sincerely for the lack of inclusion of Supplemental Figure 1. We will ensure that it is included in our resubmission

      (7) Line 523 states that samples with RIN < 7 were used for microarray analysis. Do the authors mean RIN > 7?

      We thank the Reviewer for identifying our mistake. The Reviewer is correct that this should have been a RIN >7. We have now corrected this.

      (8) It is mentioned in lines 603-604 that paraffin shrinkage was accounted for. It could be useful to describe how this was done.

      We have revised the text within the Materials and Methods to provide additional clarity on how we compensated for the shrinkage due to the paraffin processing.

      In the revised Methods we have added a brief “Shrinkage correction” subsection describing how paraffin-embedding shrinkage was quantified for each placenta individually. Specifically, we now state that post-embedding placental volume was estimated using the Cavalieri Principle on systematic and uniformly-random sampled H&E sections, and a per-placenta volume shrinkage coefficient (k<sub>V</sub> = V<sub>post</sub>/V<sub>pre</sub>) was calculated.

      We have also added the equations showing how this coefficient was used to correct compartment volumes and the derived surface area estimates (surface area calculated from S<sub>v</sub> and the corresponding shrinkage-corrected placenta volume). Please see lines 618-644.

      (9) This may be due to the generation of the reviewer PDF, but Figure 4E and 4H are illegible in our version of the manuscript.

      We apologies for the lower resolution with these figures and the difficulty in seeing the information presented. We have created revised versions of these figures which we hope are of higher quality and clarity.

      (10) What do the stars represent in Figure 6A, E, I - compared to what, controls?

      The Reviewer is correct that the asterisks in Figures 6A, E and I represent differences in the proportion of fetuses either above or below the 90th and 10th centile of the CD fetuses respectively. As such, in panel A, for both the LPD and MD-LPD groups, there are significantly more fetuses who are below the 10th centile of the CD group. Similarly, in panel E, there are significantly more placentas in the LPD group that have a weight above the 90th centile of the CD group. We have revised the graphs to make these differences, and their comparisons clearer.

      Reviewer #2 (Recommendations for the authors):

      Some Recommendations for improving the writing and presentation, and minor corrections to the text and figures:

      (1) Please describe Wnt signaling in the Abstract.

      The Abstract has been amended to provide some additional text regarding Wnt signalling. Please see lines 60-63.

      (2) Page 6, line 134: A brief explanation of why measuring the inhibin beta-A chain should be included.

      The text within this section has been amended to include a brief description of the role of Inhibin β-A chain on testicular function. Please see lines 135-139.

      (3) The methodology used for Tnf determination is missing and should be described.

      We apologies for the lack of detail regarding our analysis of serum Tnf in our males. This has now been included. Please see lines 479-480.

      (4) It is important to mention that free fatty acid levels in the MD-WD group were similar to those in the CD group, although they remained comparable to the WD group.

      We agree with the Reviewer and have amended the text to indicate that there was no difference in the FFA profile of the MD-WD males to either the CD or WD males. Please see lines 147-148.

      (5) Figure 2 presents both metabolic parameters and bacterial profile analyses. Although the authors appear to relate these outcomes, clarity would be improved by presenting them in separate figures.

      As requested, we have now presented these data as two separate Figures

      (6) Figure 3H: The data suggest that the decrease in the number of spermatogonia (PLZF⁺) observed in the LPD and WD groups was prevented when the diets were supplemented with methyl donors.

      (7) However, the description and interpretation of this result (or of a neutral effect) are missing.

      We agree with the Reviewer in their interpretation of the PLZF+ data. We have indicated this in the text within the Results and Discussion sections. Please see lines 177-178 and lines.

      (8) Line 284: Please check the abbreviation for MD-LPD.

      We thank the Reviewer for identifying this typographical mistake. This has now been corrected to state MD-LPD and not MDL.

      (9) Line 285: Please check the lettering in the text and in Figure 6H-K.

      We thank the Reviewer for identifying this typographical mistake. This has now been corrected to state the panels are Figure 9H-K, as we have split the original Figure 2 into two figures.

    1. eLife Assessment

      The work by van der Pijl presents important findings on the role of titin-associated muscle ankyrin repeat proteins (MARPs) on hypertrophy via mTOR signalling. The study presents rigourous data using in vivo loss-of-function and pharmacological approaches to investigate effects on hypertrophy. While the evidence supporting the role of MARPs on hypertrophy is solid, there are limitations. For example, the use of Rapamycin only inhibits some aspects of mTORC1 signalling and the study is limited to analysis of the diaphragm and thus it is not clear if the mechanisms are conserved across other muscle types.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      In this manuscript, the authors employ diaphragm denervation in rats and mice to study titin-based mechanosensing and longitudinal muscle hypertrophy. By integrating bulk RNA-seq, proteomics, and phosphoproteomics, they map the stretch-responsive signalling landscape, uncovering robust induction of the muscle-ankyrin-repeat proteinsௗ(MARP1-3) together with enhanced phosphorylation of titin's N2A element.

      Genetic ablation of MARPs in mice amplifies longitudinal fibre growth and is accompanied by activation of the mTOR pathway, whereas systemic rapamycin treatment suppresses the hypertrophic response, highlighting mTORC1 as a key downstream effector of titin/MARP signalling.

      Strengths:

      The authors address a clear biological question: "how titin-associated factors translate mechanical stretch into longitudinal fibre growth" using a unique and clinically relevant animal model of diaphragm denervation. Using a comprehensive multiomics approach, the authors identify MARPs as potential mediators of these effects and use a genetic mouse model to provide compelling evidence supporting causality. Additionally, connecting these findings to rapamycin, a drug widely used clinically, further increases the relevance and potential impact of the study.

    3. Reviewer #2 (Public review):

      Summary:

      Muscle hypertrophy is a major regulator of human health and performance. Here, van der Pilj and colleagues assess the role of the giant elastic protein, titin, in regulating the longitudinal hypertrophy of diaphragm muscles following denervation. Interestingly, the authors find an early hypertrophic response, with 30% new serial sarcomeres added within 6 days, followed by subsequent muscle atrophy. Using RBM20 mutant mice, which express a more compliant titin, the authors discovered that this longitudinal hypertrophy is mediated via titin mechanosensing. Through an omics approach, it is suggested that the Muscle ankyrin proteins may regulate this approach. Genetic ablation of MARPs 1-3 blocks the hypertrophic response, although single knockouts are more variable, suggesting extensive complementation between these titin binding proteins. Finally, it is found through the administration of rapamycin that the mTOR signalling pathway plays a role in longitudinal hypertrophic growth.

      Strengths:

      This paper is well written and uses an impressive suite of genetic mouse models to address this interesting question of what drives longitudinal muscle growth.

      Weaknesses:

      While the findings are of interest, they lack sufficient mechanistic detail in the current state to separate cross-sectional versus longitudinal hypertrophy. The authors have excellent tools such as the RBM20 model to functionally dissect mTOR signalling to these processes. It is also unclear if this process is unique to the diaphragm or is conserved across other muscle groups during eccentric contractions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      The study presents important insights into the regulation of muscle hypertrophy, regulated by Muscle Ankyrin Repeat Proteins (MARPs) and mTOR. The methods are overall solid and complementary, with only minor limitations. Overall, the findings will be of interest for both muscle-biology specialists and the broader mechanobiology community.

      We thank the editors for their interest in our manuscript. Below we respond to the reviewer’s comments. Based on these comments we made extensive textual revisions throughout the manuscript, and we added additional analyses to the revised results.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors employ diaphragm denervation in rats and mice to study titin‑based mechanosensing and longitudinal muscle hypertrophy. By integrating bulk RNA‑seq, proteomics, and phosphoproteomics, they map the stretch‑responsive signalling landscape, uncovering robust induction of the muscle‑ankyrin‑repeat proteins (MARP1‑3) together with enhanced phosphorylation of titin's N2A element. Genetic ablation of MARPs in mice amplifies longitudinal fibre growth and is accompanied by activation of the mTOR pathway, whereas systemic rapamycin treatment suppresses the hypertrophic response, highlighting mTORC1 as a key downstream effector of titin/MARP signalling.

      Strengths:

      The authors address a clear biological question: "how titin‑associated factors translate mechanical stretch into longitudinal fibre growth" using a unique and clinically relevant animal model of diaphragm denervation. Using a comprehensive multiomics approach, the authors identify MARPs as potential mediators of these effects and use a genetic mouse model to provide compelling evidence supporting causality. Additionally, connecting these findings to rapamycin, a drug widely used clinically, further increases the relevance and potential impact of the study.

      We thank the reviewer for their kind words and critical review of our manuscript. The roles of the MARP proteins are diverse and form an intriguing target for further study.

      Weaknesses:

      There are several areas where the manuscript could be substantially improved.

      (1) The statistical analysis of multi-omics data needs clarification. Typically, analyses across multiple experimental groups require controlling the false discovery rate (FDR) simultaneously to avoid reporting false-positive findings. It would be very helpful if the authors could specify whether adjusted p-values were calculated using a multi-factorial statistical model (e.g., ~group) or through separate pairwise contrasts.

      We agree with the reviewer that the description of the statistical analysis could be improved. We report the q-values in the supplemental data tables to correct for false positive data, the p-values reflect pairwise comparisons. Statistical testing was performed on whole proteomes or phospho-proteomes, making for very stringent testing (please also see reply to reviewer 2, response 5). Unbiased quantitative proteomics functions primarily as a screen, in-solution digestion of muscle proteins yields comparatively few peptides making population adjusted p-value calculation very stringent, suggesting no/few differences in expression. Hence, we compared RNAseq to proteome data to isolate consistently differential proteins. We have revised the method section (lines 745-746) to include clarifications of the FDR analysis.

      (2) (A)There are three separate points regarding MARP3 that could be improved. First, the authors report that MARP3-KO mice exhibit smaller increases in muscle mass after diaphragm denervation compared to wild-type mice (a -13% difference), indicating MARP3 likely promotes rather than attenuates hypertrophy. However, the manuscript currently states the opposite (lines 215-216); this interpretation should be revisited. (B) Second, it would be valuable if the authors could provide data showing whether MARP3 transcript or protein levels change response to denervation - if they do not, discussing mechanisms behind the observed phenotype would help clarify the findings. (C) Finally, given that some MARP-KO mice already exhibit baseline differences, employing and reporting the full two-way ANOVA (including genotype × treatment interaction) would allow a direct statistical assessment of whether MARP deficiency modifies the muscle's response to stretch. This analysis would help clearly resolve any existing ambiguity.

      (A) Compared to wildtype mice, MARP3 KO mice exhibit baseline diaphragm hypertrophy. This suggests that MARP3 may normally restrain hypertrophy under basal conditions. However, in response to UDD, MARP3 KO mice display an attenuated hypertrophic response, which could be interpreted as MARP3 promoting hypertrophy under stress conditions, as noted by the reviewer. The relationship between MARP3 and metabolism remains incompletely understood, but prior studies indicate that loss of MARP3 enhances glucose tolerance and insulin sensitivity (PMID: 12456686), suggesting that MARP3 may act as a negative regulator of metabolic signaling. Both glucose and insulin can activate the PI3K pathway to promote hypertrophy (PMID: 16679293), which may contribute to the baseline hypertrophy observed in MARP3 KO diaphragms. In addition, MARP3 deficiency has been associated with activation of AMPK signaling (PMID: 26398569). AMPK is a key regulator of metabolic pathways and a well-established inhibitor of hypertrophic signaling, in part through suppression of mTOR activity, and is also responsive to mechanical stimuli (PMID: 18556591). Thus, increased AMPK activity in MARP3 KO mice may limit hypertrophy in response to UDD. Supporting this, our phospho-proteomics data indicate increased activation of the AMPK β-subunit following UDD, suggesting a potential role for AMPK signaling in stretch-induced hypertrophy. Based on these considerations, we have removed the statement that MARP3 attenuates hypertrophy and instead incorporated the potential role of AMPK signaling into the Discussion (lines 354–355). While the present study focuses on the triple MARP KO model, future work will examine the specific contributions of individual MARP proteins to muscle hypertrophy.

      (B) MARP3 (Ankrd23) upregulation at the RNA level was detected by RNA-seq in rat diaphragm following both UDD and BDD (Supplemental Tables 1 and 2). This is consistent with our prior findings in mice, where western blot analysis showed increased MARP3 protein expression following UDD (PMID: 29978560). We note that reliable detection of MARP3 protein remains technically challenging due to limited availability of specific antibodies.

      (C) We agree with the reviewer and have added the results of the two-way ANOVA to the figures (see updated Figure 4). The three MARP proteins exhibit differential effects on diaphragm hypertrophy, supporting their role as modulators of stretch-induced hypertrophy.

      (3) The current presentation of multi-omics data is somewhat difficult to follow, making it challenging to determine whether observed changes occur at the transcript or protein level due to inconsistent gene/protein naming and capitalization (e.g., proper forms are mTOR, p70 S6K, 4E-BP1). Clearly organizing and presenting transcript and protein-level changes side-by-side, especially for key molecules discussed in later experiments, would make the data more accessible and provide clearer insights into the biology of titin-mediated mechanosensing.

      We agree with the reviewer that naming conventions between gene and protein can be hard to follow. We kept the names for titin-associated proteins as some have multiple protein names and the most common names is shown here. However, we made the suggested changes for the mTOR related proteins (for example, see figure 5).

      (4) The current analysis relies on total protein measurements downstream of mTOR, yet mTOR's primary mode of action is to change phosphorylation status. Because the authors have already generated a phosphoproteomic dataset, it would be very helpful to report - or at least comment on - whether known mTOR target phosphosites were detected and how they respond to denervation and rapamycin. Including even a brief summary of canonical sites such as S6K1 Thr389 or 4E - BP1 Thr37/46 would make the link between mTOR activity and hypertrophy much clearer.

      We agree with the reviewer that the mTOR data requires more work to ascertain its function in regulating hypertrophy following UDD. We investigated S6K1 Thr389 or 4E BP1 Thr37/46 in both the phosphoproteomic dataset and by western blot. These sites do not appear in phosphoproteome mass spectrometry (supplemental data table 13) and 4E BP1 Thr37/46 was unchanged by western blot (not shown). The S6K1 Thr389 antibody was aspecific in our hands, but Norrby et al (PMID: 22657251) saw increased levels by 6-days UDD. Hence the mTOR aspect of this study is quite complex, suggesting mTOR plays a major role in UDD hypertrophy, but potentially through an alternative activation pathway from what is classically described for muscle hypertrophy. We are investigating the mTOR mechanism further focusing on mTOR’s role in regulating longitudinal hypertrophy with potential connection to titin signaling and hope to publish this in the next few years. We revised the discussion to include canonical mTOR activation in hypertrophy, please see lines 388-392.

      (5) Finally, since rapamycin blocks only a subset of mTOR signalling, a brief discussion that distinguishes rapamycin‑sensitive from rapamycin‑insensitive pathways would be valuable. Clarifying whether diaphragm stretch relies exclusively on the sensitive branch or also engages the resistant branch would place the results in a broader mTOR context and deepen the mechanistic narrative.

      We agree with the reviewer that distinguishing between rapamycin-sensitive and -insensitive mTOR signaling adds useful context to the interpretation of stretch-induced hypertrophy. Rapamycin primarily inhibits mTORC1, whereas mTORC2 is generally considered rapamycin-insensitive, although prolonged or high-dose exposure can also affect mTORC2 activity. Our data indicate that UDD induces a form of hypertrophy that is sensitive to rapamycin, supporting a prominent role for mTORC1 in this process. However, we cannot exclude the possibility that rapamycin-insensitive pathways, including mTORC2 signaling, also contribute. Notably, denervation itself may influence mTORC2 activity, which could complicate the distinction between stretch- and denervation-mediated signaling. Given these considerations, we have added a brief discussion to acknowledge potential contributions of rapamycin-insensitive mTOR signaling (lines 379-384). A more comprehensive dissection of mTORC1 versus mTORC2 signaling in this context will require targeted approaches and falls beyond the scope of the present study.

      Reviewer #1 (Recommendations for the authors):

      Minor comments:

      (6) The manuscript notes that KEGG analysis "confirmed" the GO‑term findings. Because KEGG pathways and GO terms describe different types of biological information, it might be clearer simply to present them as complementary lines of evidence rather than one validating the other.

      We agree and modified the text accordingly. “Concurrently, KEGG PATHWAY database searches (Supplemental data Table 6) indicated that the DEG’s are involved in muscle remodeling.” See lines 166-169.

      (7) Figure 2's legend mentions a two‑way ANOVA, but the specific factors tested are not specified. Listing those two factors would help readers interpret the statistics more easily.

      The two-way ANOVA refers to the violin plot in figure 2E and tests the difference of the 2 surgical modalities sham vs UDD and sham vs BDD. Sham groups were combined in the graphs for easy comparison. We clarified the text of figure legend 2.

      (8) The Methods briefly describe phosphopeptide enrichment, but additional details on the criteria for site identification - such as the localisation algorithm, probability cut‑off, and FDR thresholds - would make the phosphoproteomics section more transparent and reproducible.

      Please see the updated method section, lines 756-765

      Reviewer #2 (Public review):

      Summary:

      Muscle hypertrophy is a major regulator of human health and performance. Here, van der Pilj and colleagues assess the role of the giant elastic protein, titin, in regulating the longitudinal hypertrophy of diaphragm muscles following denervation. Interestingly, the authors find an early hypertrophic response, with 30% new serial sarcomeres added within 6 days, followed by subsequent muscle atrophy. Using RBM20 mutant mice, which express a more compliant titin, the authors discovered that this longitudinal hypertrophy is mediated via titin mechanosensing. Through an omics approach, it is suggested that the Muscle ankyrin proteins may regulate this approach. Genetic ablation of MARPs 1-3 blocks the hypertrophic response, although single knockouts are more variable, suggesting extensive complementation between these titin binding proteins. Finally, it is found through the administration of rapamycin that the mTOR signalling pathway plays a role in longitudinal hypertrophic growth.

      Strengths:

      This paper is well written and uses an impressive suite of genetic mouse models to address this interesting question of what drives longitudinal muscle growth.

      We appreciate the reviewer’s kind words on our manuscript and their critical review of our work. A potential separate mechanism governing cross-sectional versus longitudinal hypertrophy is of great interest and something we aim to address in future manuscripts.

      Weaknesses:

      While the findings are of interest, they lack sufficient mechanistic detail in the current state to separate cross-sectional versus longitudinal hypertrophy. The authors have excellent tools such as the RBM20 model to functionally dissect mTOR signalling to these processes. It is also unclear if this process is unique to the diaphragm or is conserved across other muscle groups during eccentric contractions.

      Reviewer #2 (Recommendations for the authors):

      (1) Cross-sectional hypertrophy characterization: The paper emphasizes longitudinal hypertrophy but does not quantify the contribution of radial (cross-sectional) hypertrophy to the total mass increase. Given that the denervated costal diaphragm shows ~50% increase in mass (Figure 1B) but there is only ~30% fiber lengthening, it is important to determine the proportion attributable to fiber diameter changes. Histological analysis of muscle fiber cross-sectional area would clarify the relative contributions of longitudinal versus radial hypertrophy to the overall mass phenotype.

      We agree with the reviewer that radial hypertrophy is an important mechanism for muscle weight gain in UDD. In previous work we characterized both the radial and longitudinal hypertrophy response in 6-day UDD and found that ~20% of the mass gain seen in UDD is radial hypertrophy (PMID: 29978560). We reference this paper in the discussion section, line 277-278. Doing a full histological work-up of UDD diaphragm would be interesting but falls outside the scope of this manuscript. Our focus was to characterize longitudinal hypertrophy by addition of sarcomeres in series and provide insight into titin’s role in regulating longitudinal hypertrophy. We hope that the reviewer agrees with this approach.

      (2) Titin isoform expression analysis: At line 103, the authors propose that longitudinal hypertrophy reduces strain on titin by decreasing fractional sarcomere extension. However, this hypothesis does not exclude the possibility of isoform switching to a less elastic titin variant, which may compensate for changes in mechanical stress. The RNA-sequencing data should be analyzed for titin exon usage patterns between sham and UDD to determine whether changes in isoform composition (e.g., PEVK region splicing) accompany longitudinal hypertrophy. If isoform switching occurs, this represents an alternative or complementary mechanism to sarcomere addition.

      We analyzed titin exon usage in rat following both UDD and BDD. Increases in sarcomeres in series associated with UDD show modest changes in titin exon usage, though not significant by population adjusted p-values. The denervation effect of BDD did show changes in splicing, indicating lower inclusion of PEVK encoding exons, suggesting a stiffening of the titin molecules. Stiffening of titin molecules might be protective for the fully paralyzed diaphragm and preserve muscle mass. This would align with our prior publication (PMID: 29978560) which showed that stiffer titin generated more radial hypertrophy in response to UDD. In response to the reviewer’s comment, we added the splicing data to the supplemental data as new figure 2 and briefly address titin splicing in the results section, see lines 121-125.

      (3) The comparison of 3-day unilateral diaphragm denervation (UDD) and bilateral diaphragm denervation (BDD) in rats (Figure 1D-E) is used to argue that hypertrophic signaling is stretch-dependent rather than denervation-dependent. However, this interpretation requires clarification. In mice, hypertrophy is detectable as early as 1 day post-UDD, whereas the 3-day BDD protocol may drive an accelerated hypertrophic-to-atrophic remodelling process given the severity of the model. Moreover, longitudinal and global muscle hypertrophy may operate through distinct mechanisms: denervation could suppress longitudinal hypertrophy through a separate pathway while promoting or delaying cross-sectional hypertrophy. The authors should acknowledge that the current evidence does not fully exclude denervation-dependent mechanisms and should consider extended BDD time points or additional mechanistic studies to clarify this distinction.

      UDD and BDD are both denervation models and hypertrophy occurs in the denervated costal of UDD operated animals. Stretch is thus the mechanical difference between UDD and BDD and thus the trigger for hypertrophy signaling. At the denervation signaling level both models should in principle be comparable and are unlikely to play different roles between UDD and BDD, except that UDD also induces a more potent hypertrophy signaling profile on top of the atrophy program. That said, BDD is a more severe model and respiration rate is depressed compared to UDD where respiration rate is elevated. BDD rats also engage in abdominal breathing, which mildly stretches the diaphragm. Hypoxia is likely to play a stronger role in BDD than UDD and could thus further enhance the atrophy profile of BDD. We agree with the reviewer that more work is needed to elucidate the BDD remodeling response, however UDD induced stretch is the main driver of longitudinal hypertrophy. In response to the reviewer’s comment, we have added clarifying text to the discussion, lines 286-292.

      The potential for there being two independent mechanisms for both radial and longitudinal hypertrophy is of great interest to us. We foresee that dissecting out these differences will require a cell culture-based approach and will aid in avoiding the complexity of overlapping denervation and hypertrophy signals as seen in this manuscript.

      (4) Characterization of RBM20 models: The RBM20 experiments rely on the assumption that increased titin compliance reduces stretch sensitivity. However, the paper provides minimal baseline characterization of the diaphragms. Specifically: (a) What are the sarcomere lengths in RBM20-deficient diaphragms at rest and under stretch? (b) How does the passive force-length relationship differ between wildtype and RBM20-deficient diaphragm muscles? and (c) Would RBM20-deficient muscles, despite having longer sarcomeres at baseline, actually experience sufficient strain to activate mechanosensing? These data are necessary to interpret why RBM20-deficient mice show attenuated mass gain rather than none (as in BDD) during UDD (Supplemental Figure 2A-C). Additionally, what would the authors hypothesize would happen if rapamycin were used in RMB20 UDD models? It appears to be an attractive experimental approach to separate potential mTOR contributions to longitudinal versus cross-sectional hypertrophy.

      We agree with the reviewer that more work is needed on Rbm20 deficient mice and rats to elucidate their response to stretch. Part of this characterization has previously been published (PMID: 29978560) and Rbm20 splice-deficient mice have reduced passive stiffness in the diaphragm and show a robust mechanosensing response to UDD. Rbm20 splice-deficient mice also show a similar increase in longitudinal hypertrophy, but a blunted radial hypertrophy in response to 6-days UDD. The main reason for not expanding on these mice/rats further was the added complexity of Rbm20 splicing multiple targets that could affect hypertrophy signaling, for example LDB3 (ZASP) and FLNC (Filamin C) are both associated with hypertrophic cardiomyopathy. Hence for the purpose of this manuscript we showed mice and rats having a similar response to UDD, hypertrophy wise, and that titin stiffness (reduced in Rbm20-deficient animals) affects hypertrophy at the diaphragm mass level.

      Testing rapamycin on Rbm20-deficient animals could be interesting, however the complexities of also changing splicing of non-titin targets will make interpretation of mTOR signaling difficult. Perhaps an alternative approach would be to generate a titin mouse model with more compliant titin (e.g. increase the size of the PEVK segment), a model we are considering for future studies. TtnΔ112-158 mice, deleting a large portion of the PEVK region (PMID: 30565562) show increases in sarcomere number. We would expect a model with more PEVK to thus show a reduction in the number of sarcomeres in series. We discuss the role of titin stiffness in the discussion and how titin stiffness ties to longitudinal hypertrophy, please see lines 302-314.

      (5) Statistical analysis and multiple hypothesis correction: The proteomic analyses appear to employ a nominal p-value threshold (p < 0.05) without correction for multiple comparisons or false discovery rate (FDR) control. This is particularly concerning given the large number of comparisons. For example, the authors report 142 titin phosphorylation sites significantly different between sham and UDD at p < 0.05 (approximately 20% of ~700 identified sites). However, with proper FDR correction (adjusted p < 0.05), only 14 sites remain significant - a 90% reduction. This discrepancy is critical for the discussion on titin N2A phosphorylation sites pS9459 and pS9520, where only pS9520 achieves statistical significance after FDR adjustment. The authors should justify their choice of statistical thresholds and reanalyze key findings using FDR-corrected p-values. Additionally, the phosphoproteomics dataset should be screened for duplicate phosphosite identifications to ensure each site is counted only once.

      Reviewer 1 has voiced similar concerns, and we have thus expanded the methodology to explain the statistical tests used to analyze the data and the process of establishing Z-scores of isobaric peptides for the same phospho-sites (see lines 756-765). Our statistical analysis covers all detected peptides, when we only analyze the titin peptides: pS9459 is only significant in t-test, likely due to large variation in isobaric peptides. pS9520 is significant in both independent t-test and FDR. We changed figure 3D to show the fold change instead of the previous Z-score for more intuitive interpretation.

      Minor comments:

      (6) Line 52: "thesarcomeres" should read "the sarcomeres".

      A space has been added, please see line 52.

      (7) Line 52: "half-sarcomer" should read "half-sarcomere"

      Spelling has been corrected, please see line 52.

      (8) Figure clarity: Figure 1 (B-C) presents mouse data, while Figure 1 (D-E) presents rat data. This distinction should be clearly labeled in the figure legend or on the axes to prevent misinterpretation, particularly for readers unfamiliar with the experimental design.

      We added the species to the y-axis of revised figure 1B-E and added additional clarification in the figure legend.

      (9) Supplementary tables: When reporting statistical comparisons in the supplementary tables, please consider including the directionality of the statistical tests (e.g., which group was higher or lower) alongside p-values. This will facilitate interpretation without requiring reference to the main text figures.

      We agree with the reviewer and added statistical direction as a new column next to the p-values, please see the revised supplemental tables.

      (10) Given the interesting divergent findings in MARPtKO versus single knockouts, it would be interesting to assess by immunofluorescence the association of each MARP with the N2A region of titin following UDD.

      We agree with the reviewer that localization is important. Miller et al (PMID: 14583192) previously localized MARP1-3 to the N2A segment by immuno-EM and our work previously localized MARP1 to N2A using SR-SIM (PMID: 29978560). We will further investigate MARPs binding to the N2A region in an upcoming study that we intend to publish soon.

    1. eLife Assessment

      This potentially valuable study investigates the anti-senescence effects of red light exposure, proposing that reduced SIRT4 levels enhance fatty acid metabolism and H3K9ac, thereby attenuating ageing-related phenotypes. The authors use multiple approaches, including cultured cells, animal models, and molecular analyses, to support their conclusions. However, the evidence remains incomplete, as additional controls and stronger mechanistic data are needed to fully support the proposed pathway, particularly how red light exposure reduces SIRT4 levels.

    2. Reviewer #1 (Public review):

      Summary:

      Deng and colleagues pursue the possibility that red light exposure can provide some benefits and anti-senescence effects in aged mouse models. In addition, they show how red light influences metabolism in cultured keratinocytes. The authors provide a long dissection of the potential paths involved in the changes promoted by red light exposure, identifying CytC oxidase, SIRT4, PPARa and MCD as key players.

      Strengths:

      The authors did a thorough exploration of the multiple potential avenues by which red light exposure influences metabolism. The in vitro and in vivo evidence nicely complement each other.

      Weaknesses:

      This is a challenging hypothesis that would require some additional experimental controls. The pathway dissection, while extensive, is sometimes approached in unconvincing ways, and the results are not always evident to judge or interpret. Technically, the western blots and transcriptomic analyses require notable improvements.

    3. Reviewer #2 (Public review):

      Summary:

      This work identifies a previously unknown way that red light can slow ageing. The authors show that red light lowers the level of a protein called SIRT4 in skin cells. Reducing SIRT4 boosts fatty acid use and increases a type of histone modification that keeps genes active. These changes help cells clear away signs of ageing, reduce inflammation, and restore normal metabolism. The findings open the possibility of developing new treatments that target SIRT4 to reverse age‑related decline.

      Strengths:

      The evidence is solid because the authors use several complementary methods. They test red light in both cultured cells and naturally aged mice, and they confirm the key role of SIRT4 by silencing its gene. Measurements of metabolism, protein changes, and ageing markers all point in the same direction. However, the exact way red light lowers SIRT4 levels is not fully explained, which leaves a minor gap. Overall, the conclusions are well supported and convincing.

      Weaknesses:

      The paper does not evolve to use the mechanistic discoveries of the manuscript to help our community to identify the mechanism of photobiomodulation, which is not known so far.

      I would like to draw attention to a recently published paper by Herrera et al. (FEBS Letters 2025, doi:10.1002/1873-3468.70195), which shows that red light (660 nm) stimulates mitochondrial fatty acid oxidation in keratinocytes via AMPK‑dependent phosphorylation of ACC, without altering expression of electron transport chain complexes. I believe this paper is highly complementary to the current study.

      Herrera et al. demonstrate that red light increases basal, ATP‑linked, and maximal oxygen consumption rates in keratinocytes specifically through enhanced fatty acid oxidation (inhibited by etomoxir). This independently validates the central finding of the current manuscript, i.e., red light boosts lipid metabolism, strengthening the robustness of this concept.

      While the current manuscript focuses on the SIRT4‑MCD axis, Herrera et al. identify AMPK phosphorylation and ACC inhibition as key effectors. The authors can integrate and expand their discussion, since SIRT4 downregulation may converge on AMPK activation, or they may represent parallel, reinforcing mechanisms. This would enrich the mechanistic model and open new hypotheses.

      The mechanism of photobiomodulation: Herrera et al. explicitly challenge the prevailing paradigm that red light acts solely via cytochrome c oxidase (by showing long‑lasting effects, unchanged OXPHOS protein levels, and no difference in permeabilised cells). The current finding (red light acts through SIRT4 downregulation, i.e., not direct enzymatic activation) aligns perfectly with Herrera´s critique.

      Long‑term metabolic effects - Herrera et al. show that a single red light exposure elevates oxygen consumption for up to 2 days. The current study focuses on changes at 12‑24 h. Their data extend the time window and suggest that the metabolic reprogramming you describe may persist longer than currently discussed, which is clinically relevant.

      Discussing Herrera et al.'s results would not only acknowledge independent, corroborating evidence but would also allow the authors to position their SIRT4‑centric mechanism within a broader, emerging understanding of red‑light photobiomodulation.

    4. Author response:

      Reviewer #1 (Public review):

      Weaknesses:

      This is a challenging hypothesis that would require some additional experimental controls. The pathway dissection, while extensive, is sometimes approached in unconvincing ways, and the results are not always evident to judge or interpret. Technically, the western blots and transcriptomic analyses require notable improvements.

      We would like to thank the reviewer for the careful and patient examination of the issues identified in our manuscript. The poor quality of some of the Western blot bands in Figure 4 may have been caused by inappropriate electrophoresis conditions during the Western blot experiments. In the revised manuscript, we will optimize the electrophoresis conditions to obtain higher-quality protein bands and update the quantitative data. Regarding the quantification format, we believe that heatmaps provide a more intuitive representation of trends in protein expression across different treatment groups. This approach more accurately reflects the results of our biological replicates than simply analyzing the significance of differences in the grayscale values of protein bands. For the analysis of transcriptomic data, we will conduct a more detailed analysis of signal pathway enrichment and the identified differentially expressed genes to ensure that predicted genes are excluded from our current results and redundant data presentation is removed.

      Regarding additional experimental controls, such as incorporating experimental data under blue light treatment conditions as a control for red light. While exploring the optimal red light irradiation dose at the cellular level, we simultaneously conducted experiments on the effects of blue light irradiation at the same dose on keratinocyte activity. The results indicated that as the blue light irradiation dose increased (0–160 J/cm<sup>2</sup>), the keratinocyte activity exhibited a dose-dependent decline. This indicates that blue light is phototoxic to keratinocytes. The relevant experimental results have already been published in our previous study (Communications Biology 2024, doi: 10.1038/s42003-024-06973-1). Taken together with the data from our study, this demonstrates that the anti-aging effects of red light reported in the current manuscript are indeed driven by red light.

      Reviewer #2 (Public review):

      Weaknesses:

      The paper does not evolve to use the mechanistic discoveries of the manuscript to help our community to identify the mechanism of photobiomodulation, which is not known so far.

      I would like to draw attention to a recently published paper by Herrera et al. (FEBS Letters 2025, doi:10.1002/1873-3468.70195), which shows that red light (660 nm) stimulates mitochondrial fatty acid oxidation in keratinocytes via AMPK‑dependent phosphorylation of ACC, without altering expression of electron transport chain complexes. I believe this paper is highly complementary to the current study.

      Herrera et al. demonstrate that red light increases basal, ATP-linked, and maximal oxygen consumption rates in keratinocytes specifically through enhanced fatty acid oxidation (inhibited by etomoxir). This independently validates the central finding of the current manuscript, i.e., red light boosts lipid metabolism, strengthening the robustness of this concept.

      While the current manuscript focuses on the SIRT4-MCD axis, Herrera et al. identify AMPK phosphorylation and ACC inhibition as key effectors. The authors can integrate and expand their discussion, since SIRT4 downregulation may converge on AMPK activation, or they may represent parallel, reinforcing mechanisms. This would enrich the mechanistic model and open new hypotheses.

      The mechanism of photobiomodulation: Herrera et al. explicitly challenge the prevailing paradigm that red light acts solely via cytochrome c oxidase (by showing long-lasting effects, unchanged OXPHOS protein levels, and no difference in permeabilised cells). The current finding (red light acts through SIRT4 downregulation, i.e., not direct enzymatic activation) aligns perfectly with Herrera´s critique.

      Long-term metabolic effects-Herrera et al. show that a single red light exposure elevates oxygen consumption for up to 2 days. The current study focuses on changes at 12-24 h. Their data extend the time window and suggest that the metabolic reprogramming you describe may persist longer than currently discussed, which is clinically relevant.

      Discussing Herrera et al.'s results would not only acknowledge independent, corroborating evidence but would also allow the authors to position their SIRT4-centric mechanism within a broader, emerging understanding of red-light photobiomodulation.

      We would like to thank the reviewer for providing us with constructive suggestions for discussion. Our results showed that under red light conditions, both glycolipid and lipid metabolism were activated in keratinocytes, and cellular metabolic flux increased. The activation of lipid metabolism directly led to an increase in metabolism-associated H3K9ac and drove the upregulation of anti-aging-related genes; we believe this is key to the anti-aging effects of red light. Mechanistic analysis combining proteomics and acetylation proteomics revealed that red light significantly downregulated SIRT4 expression and increased the acetylation of MCD, a protein regulated by SIRT4 that governs cellular fatty acid oxidation rates. Through validation using cell-level knockdown and inhibitors, we confirmed that SIRT4 inhibition exerts anti-aging effects in vitro and that inhibiting MCD function under red light conditions suppresses H3K9ac. These results establish the role of the SIRT4-MCD signalling axis in mediating the anti-aging effects of red light.

      The study by Herrera et al. included a substantial body of validation data confirming the role of red light in promoting fatty acid oxidation, providing robust empirical support for our research. Furthermore, Herrera et al. revealed that red light-induced fatty acid oxidation depends on AMPK and ACC phosphorylation. This mechanism of red-light photobiomodulation may refute the notion that its bio-regulatory effects rely solely on the action of mitochondrial cytochrome c oxidase. Furthermore, together with our study revealing that red light exerts anti-aging photobiomodulatory effects via the SIRT4-MCD signalling axis, these findings independently confirm that red light regulates cellular fatty acid oxidation, thereby demonstrating the pivotal role of activated fatty acid oxidation in the bio-regulatory effects of red light. In the revised manuscript, we will include a discussion on the potential link between the red light-driven downregulation of SIRT4 and the phosphorylation of AMPK/ACC. This will be of positive value in elucidating how SIRT4 exerts its anti-aging effects by regulating lipid metabolism, as well as in explaining the possible mechanisms by which red light downregulates SIRT4.

    1. eLife Assessment

      The study presents valuable findings regarding the impact of ARHGEF6 deletion, a RhoGTPase regulator linked to X-linked intellectual disability (XLID46), in the development of interneurons. The evidence supporting the observed cellular and developmental phenotypes collected in both mouse and human iPSC models is convincing, although further work would strengthen the mechanistic interpretation and clarify the specificity of the findings. This work offers new insights into ARHGEF6 function and the potential contribution of its dysfunction to neurodevelopmental disorders.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript has several strengths, including a technically comprehensive approach that combines mouse genetics, electrophysiology, live imaging in assembloids, and human organoid models, providing a rich and multifaceted dataset. Cross-species validation through the parallel use of mouse and human systems strengthens the generality of the observed phenotypes and increases relevance to human neurodevelopment.

      Consistent phenotypic observations across systems show that ARHGEF6 loss affects migration, neurite morphology, growth cone structure, and neuronal survival, supporting a coherent role in cytoskeletal regulation.

      There is clear evidence for developmental defects, including reduced interneuron numbers, increased apoptosis in the ganglionic eminences, and migration deficits, all well supported by quantitative analyses. Also, there is a high-quality electrophysiological characterization that demonstrates reduced firing in interneurons, providing a well-controlled functional phenotype.

      Strengths:

      The manuscript has several strengths, including a technically comprehensive approach that combines mouse genetics, electrophysiology, live imaging in assembloids, and human organoid models, providing a rich and multifaceted dataset. Cross-species validation through the parallel use of mouse and human systems strengthens the generality of the observed phenotypes and increases relevance to human neurodevelopment.

      Consistent phenotypic observations across systems show that ARHGEF6 loss affects migration, neurite morphology, growth cone structure, and neuronal survival, supporting a coherent role in cytoskeletal regulation.

      There is clear evidence for developmental defects, including reduced interneuron numbers, increased apoptosis in the ganglionic eminences, and migration deficits, all well supported by quantitative analyses. Also, there is a high-quality electrophysiological characterization that demonstrates reduced firing in interneurons, providing a well-controlled functional phenotype.

      Weaknesses:

      Despite the strengths mentioned above, the study has some conceptual and experimental weaknesses that reduce its impact. The mechanistic insight is limited, as the research does not directly establish how ARHGEF6 regulates downstream signaling pathways.

      Also, there is insufficient evidence for interneuron specificity; although the central claim is that ARHGEF6 plays a selective role in interneurons, the data do not adequately exclude the possibility that the observed effects reflect broader neuronal defects. The study lacks critical controls across cell types, as several phenotypes observed in organoids and progenitors, including apoptosis, reduced neuronal output, and altered morphology, could also affect multiple neuronal populations without being directly tested. Furthermore, the data are predominantly descriptive, with many results remaining correlative and failing to establish causal relationships.

      Some more comments:

      (1) Given that ARHGEF6 is a guanine nucleotide exchange factor for Rac1 and Cdc42, the absence of direct measurements of GTPase activity or downstream signaling represents a significant gap. The interpretation that the observed phenotypes are mediated through specific cytoskeletal pathways, therefore, remains inferential.

      (2) The manuscript repeatedly interprets the findings as interneuron-specific. However, several key observations are not demonstrated to be restricted to IN. Without direct comparison to excitatory neurons or other cell types, it is difficult to conclude that ARHGEF6 plays a selective role in interneurons rather than a more general role in neuronal development. The well-done analysis of the transcriptomic dataset is not sufficient to claim IN specificity. This issue is particularly important for the interpretation of the human organoid experiments, where reductions in SOX2⁺ progenitors and NEUN⁺ neurons, as well as increased apoptosis, could reflect global developmental defects. Similarly, in the mouse experiments, the reduction in GAD67⁺ cells is compelling, but it is not shown whether other neuronal populations are also affected.

      (3) The study provides a strong phenotypic description but limited causal resolution. For example, migration defects, altered growth cone morphology, and reduced branching are all consistent with impaired cytoskeletal regulation, but the links between these phenotypes are not directly established. Likewise, while the electrophysiological data convincingly show reduced firing in interneurons, the connection between altered cytoskeletal dynamics and intrinsic excitability is not explored.

      (4) Several aspects of data presentation could be improved. In multiple figures (e.g., Figure 1A, D; Figure 4 and Video S1, 2), the images are difficult to interpret due to high cellular density, limited magnification, or lack of clear annotation. In some cases, it is not fully clear how quantifications were performed or which regions were analyzed. Improving the visual clarity with arrows, boxes, and high-magnification inserts of the data would strengthen confidence in the conclusions.

    3. Reviewer #2 (Public review):

      The authors investigate the impact of the deletion of the small GTPase regulator ARHGEF6 on the development and physiology of interneurons. Using public databases, they first show that ARHGEF6 is enriched in interneurons or in areas that give rise to them, both in development and adulthood, in humans and mice. Using a complete KO mouse previously reported, and using a GAD67-GFP reporter mice line, they show that in the adult mouse cortex and hippocampus, there is a notorious reduction GFP+ cells. These mice show increased apoptotic cells at different timepoints and areas of the brain during development. In the developing cortex of ARHGEF6-KO mice, there are fewer IN in all layers of the developing cortex, and cells present processes not correctly oriented. IN from the hippocampus in culture show reduced excitability and impaired neurite branching. The authors then established isogenic hiPSCs lines to study ARHGEF6 deletion in human cells and differentiated ventral forebrain neurons, to find interneuron-related and non-related phenotypes. Most importantly, human interneurons grown in organoids show reduced branching and altered growth cone morphology. The authors claim that the novel interneuron phenotypes found in these models can explain, in part, the human intellectual disabilities associated with mutations in this protein. The study is well conducted and opens new avenues of research not only for the role of small GTPases regulation in early nervous system development, but also for how interneuron deficiencies impact a wider range of intellectual disability syndromes found in humans.

      However, most conclusions of the present version would be strengthened after considering the following comments:

      Major comments

      (1) The reported biological processes evaluated at different developmental stages may be directly or indirectly related to ARHGEF6 function itself. As a model of a hereditary disease, full organism gene deletion is valid, since the human patients suffer from that condition as well. However, to investigate the roles of a protein, complete deletions may not be very accurate since they can give rise to phenotypes that are only indirectly related to the protein function itself. Most conclusions of the present manuscript should either be discussed in this regard or add evidence for a direct role of the protein. One such evidence is typically performed with acute knockdowns in culture, or in developing brains by in utero electroporation. For example, Figure 1C shows that the principal excitatory neurons in the hippocampus do not express ARHGEF6. However, most electrophysiological and behavioral evidence of defects in ARHGEF6-KO mice arises from evaluating these cells (Remakers et al., 2012). I am not suggesting that either previous or actual evidence is wrong. But I believe readers would benefit from a clear distinction (or add caution notes) between a functional consequence of the deletion (that can be months away and in other cells than the actual molecular defect) and a true cell biological function of the protein under study. In favor of the authors, this is a concern with most conclusions derived from KO organisms.

      (2) Figure 1E-G H I. All conclusions are made with a GAD67-GFP reporter, which is a very powerful and reliable tool for large-scale screening. All the conclusions of the paper would be strengthened if some immunohistochemical staining in the same areas of specific markers for interneurons would be added as supporting complementary evidence.

      (3) Cell death in development: It is surprising that the high amount of TUNEL staining during development does not translate into gross histological changes in the adult brain (studied elsewhere). Can authors discuss possible explanations?

      (4) Section 4 (Figures 2F-J) - The authors present this staining as an analysis of migration. Normally, migration studies are performed with a "pulse-chase" paradigm, where a single cohort is labeled and then followed over time (normally by in utero electroporation of a fluorescent protein). Tissue is then fixed at different time points, and migration can be followed. On the contrary, the evidence is from a single point, in an experimental setting in which all Gad67 IN are stained, and hence, one cannot imply a defect in migration. The differences between WT and ARHGEF6-KO are obvious and interesting; it is just that they cannot be solely attributed to a problem in migration.

      Also, a true phenotype of migration in the current setting should have found that the cells that failed to migrate are accumulated in deeper layers. My impression is that the changes in IN per layer are easier explained by total cell number, rather than migration. Perhaps evaluating earlier timepoints could clarify this.

      (5) It is known that ARHGEF6 deletion produces severe F-actin phenotypes in neurons. Have the authors confirmed in their hippocampal cultures GAD67 cells ALSO have these phenotypes? Stress fibers in somas, growth cones, and actin patches along neurites.

      (6) Section 4. The authors present data for deficient migration of the GFP-labeled interneurons. Is it possible to assess, in the same sections, whether other cell types are also affected? Although the hypothesis that ARHGEF6 deletion will have an impact in IN is well rooted in expression data, by assessing other cell types, one can even include a positive control or evidence for a cell-autonomous phenotype.

      (7) ARHGEDF6 deletion has an important impact on organoid development (size, shape, etc). Have the authors analysed whether these organoids produced fewer interneurons?

      (8) In assembloids, the differences in migration parameters are very small between WT and ARHGEF6-KO, which reinforces that perhaps what is observed in the different layers of cortex during mouse development is likely not entirely due to migration, as concluded.

      (9) To properly weigh the present evidence -interneuron deficits- using the ARHGEF6-KO model, authors should include a deeper discussion in light of much work that has been done using these mice. How does the finding of a diminished IN population in the brain of these mice explain the large amount of electrophysiological and behavioral evidence produced before with these animals? Perhaps the most important work to discuss these aspects is the initial ARHGEF6-KO report by Ramakers and colleagues (2012), but there are others.

      Minor comments

      (1) Figure 1A. It looks clear that the GE shows the highest expression of ARHGEF6; however, the reader needs the reference levels where the log2 expression is calculated. What are the reference levels?

      (2) Have the authors compared the number of GAD67-eGFP cells in the hippocampal cultures between WT and ARHGEF6-KO mice?

      (3) Section 3, as a caution note, authors should mention that it is not possible to know from the evidence provided which cells are dying.

      (4) In the dorsal-ventral assembloids, it is expected that the ventral organoid would contain lots of GFP expression compared to the dorsal, but in the image shown (Figure 5A) both parts of the assembloid seem to have the same amount and distribution of GFP. How is that possible?

    4. Reviewer #3 (Public review):

      Summary:

      ARHGEF6 is a RAC1/CDC42 guanine nucleotide exchange factor that has been proposed to be associated with X-linked intellectual disability, but its relevance to the pathology is not well established. ARHGEF6 has been assigned a role in spine density and plasticity of hippocampal pyramidal neurons, but nothing is known about its role in interneuron development. Here, the authors show that ARHGEF6 is expressed early in development in the inhibitory lineage during the peak of interneuron generation and migration. The aim of the study is therefore to investigate whether, in addition to its role in pyramidal neurons, ARHGEF6 could play a role in inhibitory neuron development. Using both ARHGEF6-KO mice and organoids from ARHGEF6-KO hiPSCs, the authors show that ARHGEF6 plays a critical role in interneuron development and function

      Strengths:

      The major strength of the paper is the very detailed analysis of the role of ARHGEF6 using two different systems: ARHGEF6-KO mice and deletion of ARHGEF6 in human iPSC-derived organoids. Strikingly, deletion of ARHGEF6 in both systems induces similar defects such as an increase in apoptosis, reduced neuronal output, impaired neuronal morphology, and disrupted migratory dynamics. This compelling evidence demonstrates that ARHGEF6, in addition to its already well-described role in spine formation and plasticity, is playing a crucial role during embryonic development through its function in interneurons.

      Weaknesses:

      (1) In Figure 1, the authors show that ARHGEF6 is expressed in different regions of the brain, including the interneuron lineage, and that depletion of ARHGEF6 reduces the number of GABAergic neurons in the adult cortex and hippocampus. To try to better characterize this defect, the authors in Figure 2 investigate whether deletion of ARHGEF6 affects interneuron migration and survival during embryonic development. To do so, ARHGEF6 ko mice were crossed with the GAD67-eGFP reporter line to follow the inhibitory lineage. The authors analyse apoptosis using TUNEL staining, and show that it is significantly increased in the ganglion eminence of ARHGEF6-KO E14.5 embryos. The authors claim that this is not the case in the cortex. However, the image shown in Figure 2A really suggests that staining is increased. Which part of the neocortex is analysed for quantification? This should be clarified.

      (2) In Figure 2F-J, the authors investigate the migration of interneurons by analysing the GAD67-eGFP staining, and clearly show that the migratory abilities of the depleted neurons are reduced. However, the authors do not discuss the fact that, because depletion of ARHGEF6 increases apoptosis, there are fewer neurons available for migration. This is important for the interpretation of the data. This point should be clarified.

      (3) In Supplementary Figure S2, the authors describe the establishment of the ARHGEF6-KO human iPSC line and test the ability of these cells to undergo correct development, especially for the generation of neural progenitor cells. I was wondering why the authors do not present the data of both control and ARHGEF6-KO cells.

      (4) At the molecular level, how ARHGEF6 depletion could affect neuronal survival is missing. In addition, as ARHGEF6 is a GEF for RAC1 and Cdc42 amongst other GEFs, I would have expected that the authors test how RAC1 activity (and Cdc42) is affected in ARHGEF6-depleted brains and in ARHGEF6-KO organoids. The measure of phalloidin staining and the anisotropy index are not really meaningful.

      (5) The authors show that ARHGEF6-KO forebrain organoids were markedly smaller compared to their isogenic controls, and their study suggests that ARHGEF6 expression impacts progenitor maintenance and neurogenesis. Despite representing only a minority of the total neuronal population, I was wondering whether ARHGEF6-KO mice present brain morphology defects such as microcephaly.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript has several strengths, including a technically comprehensive approach that combines mouse genetics, electrophysiology, live imaging in assembloids, and human organoid models, providing a rich and multifaceted dataset. Cross-species validation through the parallel use of mouse and human systems strengthens the generality of the observed phenotypes and increases relevance to human neurodevelopment.

      Consistent phenotypic observations across systems show that ARHGEF6 loss affects migration, neurite morphology, growth cone structure, and neuronal survival, supporting a coherent role in cytoskeletal regulation.

      There is clear evidence for developmental defects, including reduced interneuron numbers, increased apoptosis in the ganglionic eminences, and migration deficits, all well supported by quantitative analyses. Also, there is a high-quality electrophysiological characterization that demonstrates reduced firing in interneurons, providing a well-controlled functional phenotype.

      Strengths:

      The manuscript has several strengths, including a technically comprehensive approach that combines mouse genetics, electrophysiology, live imaging in assembloids, and human organoid models, providing a rich and multifaceted dataset. Cross-species validation through the parallel use of mouse and human systems strengthens the generality of the observed phenotypes and increases relevance to human neurodevelopment.

      Consistent phenotypic observations across systems show that ARHGEF6 loss affects migration, neurite morphology, growth cone structure, and neuronal survival, supporting a coherent role in cytoskeletal regulation.

      There is clear evidence for developmental defects, including reduced interneuron numbers, increased apoptosis in the ganglionic eminences, and migration deficits, all well supported by quantitative analyses. Also, there is a high-quality electrophysiological characterization that demonstrates reduced firing in interneurons, providing a well-controlled functional phenotype.

      We thank the reviewer for their positive and thoughtful assessment of our manuscript. We appreciate their recognition of the technical breadth of the study, including the integration of mouse genetics, electrophysiology, live imaging in assembloids, and human organoid models. We are also grateful that the reviewer highlights the value of our cross-species approach, as a major goal of the study was to determine whether ARHGEF6 loss produces convergent developmental and cellular phenotypes in both mouse and human systems.

      Weaknesses:

      Despite the strengths mentioned above, the study has some conceptual and experimental weaknesses that reduce its impact. The mechanistic insight is limited, as the research does not directly establish how ARHGEF6 regulates downstream signaling pathways.

      We appreciate the reviewer’s constructive comment. We agree that, although our data establish a phenotypic link between ARHGEF6 loss and interneuron development, they do not directly dissect the molecular mechanisms underlying the observed defects. Our interpretation that the mutant phenotype involves dysregulation of cytoskeletal dynamics is based on the directly observed defects in actin polymerization and organization in neural progenitor cells and neuronal growth cones respectively, and is consistent with the abnormalities observed in neurite morphology and neuronal migration. This interpretation is further supported by the established role of Arhgef6 as a regulator of the small Rho GTPases Rac1 and Cdc42. Previous evidence shows that Arhgef6 loss reduces the activity of both GTPases and deregulates the expression of the cytoskeletal regulators Pak1–3, Limk1, and Cofilin in the mouse brain (Ramakers et al., 2012). Moreover, spine abnormalities in Arhgef6-knockdown ex vivo slice cultures can be rescued by expressing the active form of Pak3, a downstream effector of Rac1 and Cdc42 (Node-Langlois et al., 2006). Together, these findings support a model in which the loss of the protein affects development through cytoskeletal dysregulation, likely involving altered Rho GTPase signalling. We nevertheless agree that further experiments would be required to establish a direct causal relationship between ARHGEF6 loss, Rho GTPase activity, cytoskeletal dysregulation, and the interneuron phenotypes described here. We will therefore revise the manuscript to clarify that this mechanistic link remains an interpretation supported by our data and the literature, rather than a direct demonstration within the present study.

      Also, there is insufficient evidence for interneuron specificity; although the central claim is that ARHGEF6 plays a selective role in interneurons, the data do not adequately exclude the possibility that the observed effects reflect broader neuronal defects. The study lacks critical controls across cell types, as several phenotypes observed in organoids and progenitors, including apoptosis, reduced neuronal output, and altered morphology, could also affect multiple neuronal populations without being directly tested.

      We agree that the current data do not exclude the possibility of alterations in other neuronal lineages, specifically the excitatory lineage. With regard to this, we would like to emphasize that the investigation of excitatory cell phenotypes was beyond the scope of the present study, as this aspect has previously been examined by Ramakers et al., 2012 and Node-Langlois et al., 2006, particularly in the context of hippocampal pyramidal cells, which are among the few cell types showing consistent expression of the gene in the adult mouse brain (Allen Brain Atlas; Yao et al., 2021). In this context, it is interesting to note that, in Ramakers et al., 2012 (Figure S1), MAP2 immunostaining of hippocampal formations revealed comparable distribution and intensity of neuronal cell bodies and dendrites throughout the hippocampus of both wild-type and Arhgef6-KO animals. With regard to morphological maturation of excitatory cells, whereas we observe a simplification of interneuron morphology in both mouse and human models, Ramakers et al., 2012 reported increased dendritic arborization complexity in hippocampal pyramidal cells. With regard to migration, a direct comparison with excitatory neurons would be intrinsically difficult, as excitatory and inhibitory neurons undergo highly distinct migratory processes and are therefore not directly comparable. We greatly appreciate the reviewer’s comment, as it gives us the opportunity to better discuss the relationship between our findings and previous studies in the Discussion. We will revise the manuscript and avoid implying that the phenotype observed is exclusive to interneurons.

      Furthermore, the data are predominantly descriptive, with many results remaining correlative and failing to establish causal relationships.

      We agree that our study primarily establishes a phenotypic framework and does not fully resolve the causal hierarchy among altered survival, migration, cytoskeletal morphology, and intrinsic excitability. We will revise the manuscript to make this limitation explicit, avoiding statements that imply direct causality beyond the data presented.

      Some more comments:

      (1) Given that ARHGEF6 is a guanine nucleotide exchange factor for Rac1 and Cdc42, the absence of direct measurements of GTPase activity or downstream signaling represents a significant gap. The interpretation that the observed phenotypes are mediated through specific cytoskeletal pathways, therefore, remains inferential.

      We appreciate the comment. The interpretation that our phenotype involves dysregulated cytoskeletal dynamics is based on the observed defects in actin polymerization and F-actin organization in neuronal growth cones and is consistent with the abnormalities in neurite morphology and neuronal migration. We will explicitly state in the Discussion that, since we did not directly measure Rac1 and Cdc42 activity levels in our models, our hypothesis regarding the involvement of this molecular pathway in the establishment of the observed phenotype therefore remains inferential, despite being supported by the current literature.

      (2) The manuscript repeatedly interprets the findings as interneuron-specific. However, several key observations are not demonstrated to be restricted to IN. Without direct comparison to excitatory neurons or other cell types, it is difficult to conclude that ARHGEF6 plays a selective role in interneurons rather than a more general role in neuronal development. The well-done analysis of the transcriptomic dataset is not sufficient to claim IN specificity. This issue is particularly important for the interpretation of the human organoid experiments, where reductions in SOX2⁺ progenitors and NEUN⁺ neurons, as well as increased apoptosis, could reflect global developmental defects. Similarly, in the mouse experiments, the reduction in GAD67⁺ cells is compelling, but it is not shown whether other neuronal populations are also affected.

      As previously mentioned, we understand the reviewer’s concern regarding the specificity of the observed phenotypes in interneurons and agree that the claims should be tempered. However, it is important to note that the interpretation of the human organoid experiments should be reconsidered. The use of specifically ventralized MGE-like organoids allowed us to assess the cell-autonomous nature of defects such as the reduction in inhibitory progenitors’ neuronal output, the increased apoptosis, and the morphological abnormalities of inhibitory neurons. We will acknowledge in the Discussion the limitations of the study with regard to assessing the cell-autonomous nature of the observed migration defects.

      (3) The study provides a strong phenotypic description but limited causal resolution. For example, migration defects, altered growth cone morphology, and reduced branching are all consistent with impaired cytoskeletal regulation, but the links between these phenotypes are not directly established. Likewise, while the electrophysiological data convincingly show reduced firing in interneurons, the connection between altered cytoskeletal dynamics and intrinsic excitability is not explored.

      The observed migration defects, altered growth-cone morphology, and reduced branching are consistent with impaired cytoskeletal regulation. However, we acknowledge that the mechanistic links among these phenotypes remain to be directly demonstrated. Similarly, although our electrophysiological data show reduced firing in ARHGEF6-KO interneurons, the present study does not provide direct evidence linking impaired excitability to altered cytoskeletal dynamics. In the latter case, we think that the underlying mechanisms should be further investigated at the subcellular level, particularly with respect to cytoskeleton-mediated intracellular trafficking and localization and distribution of ion channels. One limitation of the present study, which may have masked electrophysiological alterations associated with differences in membrane composition (current Figure S1D–H), is that different interneuron subtypes with distinct intrinsic properties were pooled together in the analysis. We will expand the Discussion to address these limitations.

      (4) Several aspects of data presentation could be improved. In multiple figures (e.g., Figure 1A, D; Figure 4 and Video S1, 2), the images are difficult to interpret due to high cellular density, limited magnification, or lack of clear annotation. In some cases, it is not fully clear how quantifications were performed or which regions were analyzed. Improving the visual clarity with arrows, boxes, and high-magnification inserts of the data would strengthen confidence in the conclusions.

      We would like to thank the reviewer for pointing this out. We agree that some images and videos would benefit from clearer annotation. In the revised manuscript, we will add high-magnification insets, arrows or boxes highlighting the relevant regions/cells, and clearer descriptions of the quantified regions. We will also improve legends and video labels to indicate genotype, region, and tracked cells.

      Reviewer #2 (Public review):

      The authors investigate the impact of the deletion of the small GTPase regulator ARHGEF6 on the development and physiology of interneurons. Using public databases, they first show that ARHGEF6 is enriched in interneurons or in areas that give rise to them, both in development and adulthood, in humans and mice. Using a complete KO mouse previously reported, and using a GAD67-GFP reporter mice line, they show that in the adult mouse cortex and hippocampus, there is a notorious reduction GFP+ cells. These mice show increased apoptotic cells at different timepoints and areas of the brain during development. In the developing cortex of ARHGEF6-KO mice, there are fewer IN in all layers of the developing cortex, and cells present processes not correctly oriented. IN from the hippocampus in culture show reduced excitability and impaired neurite branching. The authors then established isogenic hiPSCs lines to study ARHGEF6 deletion in human cells and differentiated ventral forebrain neurons, to find interneuron-related and non-related phenotypes. Most importantly, human interneurons grown in organoids show reduced branching and altered growth cone morphology. The authors claim that the novel interneuron phenotypes found in these models can explain, in part, the human intellectual disabilities associated with mutations in this protein. The study is well conducted and opens new avenues of research not only for the role of small GTPases regulation in early nervous system development, but also for how interneuron deficiencies impact a wider range of intellectual disability syndromes found in humans.

      We appreciate the reviewer’s positive evaluation of our manuscript and their recognition of this work’s potential to expand the focus of intellectual disability research on the development and function of the inhibitory system. We are particularly encouraged that the reviewer highlights the strength of our combined mouse and human cellular models, as well as the relevance of the interneuron-related phenotypes we identify across systems.

      However, most conclusions of the present version would be strengthened after considering the following comments:

      Major comments:

      (1) The reported biological processes evaluated at different developmental stages may be directly or indirectly related to ARHGEF6 function itself. As a model of a hereditary disease, full organism gene deletion is valid, since the human patients suffer from that condition as well. However, to investigate the roles of a protein, complete deletions may not be very accurate since they can give rise to phenotypes that are only indirectly related to the protein function itself. Most conclusions of the present manuscript should either be discussed in this regard or add evidence for a direct role of the protein. One such evidence is typically performed with acute knockdowns in culture, or in developing brains by in utero electroporation. For example, Figure 1C shows that the principal excitatory neurons in the hippocampus do not express ARHGEF6. However, most electrophysiological and behavioral evidence of defects in ARHGEF6-KO mice arises from evaluating these cells (Ramakers et al., 2012). I am not suggesting that either previous or actual evidence is wrong. But I believe readers would benefit from a clear distinction (or add caution notes) between a functional consequence of the deletion (that can be months away and in other cells than the actual molecular defect) and a true cell biological function of the protein under study. In favor of the authors, this is a concern with most conclusions derived from KO organisms.

      We agree with the reviewer that phenotypes observed in constitutive knockout models may, in some contexts, reflect indirect or compensatory consequences of long-term gene loss. Conditional and/or inducible knockout or knockdown approaches can certainly help dissect the nature of the observed defects and better define the effects of gene ablation at different developmental stages or in specific cell types. However, in the context of our study, it is important to note that the experiments performed in ventralized MGE-like organoids allowed us to assess the cell-autonomous nature of very early developmental defects in the inhibitory lineage, in isolation from other cell types. These defects include reduced neuronal output from inhibitory progenitors, increased apoptosis, and morphological abnormalities in inhibitory neurons. Therefore, the phenotypes reported here are less likely to reflect effects originating in, or indirectly caused by, cell types that do not express Arhgef6.

      With regard to Figure 1C, we state in the Results that “among excitatory populations, only CA3 pyramidal neurons and mossy cells exhibited expression levels comparable to those observed in inhibitory clusters (Figure 1D, Table S2),” thereby not neglecting the potential effect of the lack of a functional protein in these populations.

      (2) Figure 1E-G H I. All conclusions are made with a GAD67-GFP reporter, which is a very powerful and reliable tool for large-scale screening. All the conclusions of the paper would be strengthened if some immunohistochemical staining in the same areas of specific markers for interneurons would be added as supporting complementary evidence.

      We appreciate the insightful comment of the reviewer. Additional validation using established interneuronal markers will further strengthen the GAD67-eGFP analysis. We will perform complementary stainings (e.g., PVALB and CCK) and quantifications and include these data as a Supplementary Figure.

      (3) Cell death in development: It is surprising that the high amount of TUNEL staining during development does not translate into gross histological changes in the adult brain (studied elsewhere). Can authors discuss possible explanations?

      We appreciate the thoughtful consideration of our findings. We think that possible explanations include partial compensatory mechanisms during development, which may mitigate the long-term anatomical consequences of increased cell death. In addition, the phenotype may be restricted to specific neuronal populations or developmental windows, thereby producing functional alterations without necessarily resulting in overt macroanatomical defects. Thus, although increased developmental cell death may contribute to altered circuit assembly and neuronal output, it may not be sufficient to produce gross histological changes detectable at the adult brain level.

      (4) Section 4 (Figures 2F-J) - The authors present this staining as an analysis of migration. Normally, migration studies are performed with a "pulse-chase" paradigm, where a single cohort is labeled and then followed over time (normally by in utero electroporation of a fluorescent protein). Tissue is then fixed at different time points, and migration can be followed. On the contrary, the evidence is from a single point, in an experimental setting in which all Gad67 IN are stained, and hence, one cannot imply a defect in migration. The differences between WT and ARHGEF6-KO are obvious and interesting; it is just that they cannot be solely attributed to a problem in migration.

      Also, a true phenotype of migration in the current setting should have found that the cells that failed to migrate are accumulated in deeper layers. My impression is that the changes in IN per layer are easier explained by total cell number, rather than migration. Perhaps evaluating earlier timepoints could clarify this.

      We appreciate the reviewer’s suggestion to implement an additional time point in the in vivo migration analysis. Since an earlier in vivo time point would most likely not reveal migration-related defects, as most cells would still be confined to the ganglionic eminence (Liaci et al., 2022), we will include analyses performed at a later developmental time point as supplementary evidence. We will also revise the wording to clarify that the fixed-tissue data show altered distribution and orientation of GAD67-eGFP-positive interneurons, which are consistent with impaired migratory behavior when considered together with the in vitro live-imaging data. At the same time, we will acknowledge that reduced interneuron survival and/or neuronal output may also contribute to the observed phenotype.

      (5) It is known that ARHGEF6 deletion produces severe F-actin phenotypes in neurons. Have the authors confirmed in their hippocampal cultures GAD67 cells ALSO have these phenotypes? Stress fibers in somas, growth cones, and actin patches along neurites.

      We did not directly assess F-actin organization in GAD67-eGFP murine primary cultures. Direct analyses of F-actin organization, growth-cone morphology, and cytoskeletal organization were performed only in the human system. To further assess this phenotype, we will perform phalloidin staining on GAD67-eGFP brain sections to evaluate F-actin organization in interneurons in vivo.

      (6) Section 4. The authors present data for deficient migration of the GFP-labeled interneurons. Is it possible to assess, in the same sections, whether other cell types are also affected? Although the hypothesis that ARHGEF6 deletion will have an impact in IN is well rooted in expression data, by assessing other cell types, one can even include a positive control or evidence for a cell-autonomous phenotype.

      We thank the reviewer for their thoughtful suggestions. We agree that extending the analysis to additional cell types would provide further insight into the specificity of the phenotype; however, a comprehensive evaluation of all neuronal populations falls beyond the scope of this research. The use of ventralized MGE-like organoids enabled us to examine whether key defects were cell-autonomous, including the reduced neuronal output of inhibitory progenitors, increased apoptosis, and abnormal inhibitory-neuron morphology.

      (7) ARHGEDF6 deletion has an important impact on organoid development (size, shape, etc). Have the authors analysed whether these organoids produced fewer interneurons?

      We would like to clarify that the organoids analyzed in the study are ventral MGE-like organoids and therefore the reduction in neuronal output (current Figure 4K) primarily reflects the ventral/interneuron lineage in this model.

      (8) In assembloids, the differences in migration parameters are very small between WT and ARHGEF6-KO, which reinforces that perhaps what is observed in the different layers of cortex during mouse development is likely not entirely due to migration, as concluded.

      We agree that the migration parameters in assembloids should not be interpreted in isolation. We will revise the text to emphasize that the reduction in the number of interneurons observed in the adult brains is part of a broader pattern that also includes altered neuronal output and reduced viability.

      (9) To properly weigh the present evidence -interneuron deficits- using the ARHGEF6-KO model, authors should include a deeper discussion in light of much work that has been done using these mice. How does the finding of a diminished IN population in the brain of these mice explain the large amount of electrophysiological and behavioral evidence produced before with these animals? Perhaps the most important work to discuss these aspects is the initial ARHGEF6-KO report by Ramakers and colleagues (2012), but there are others.

      We appreciate the reviewer’s emphasis on the importance of framing our findings within the broader context of the existing literature. We will expand the Discussion to better integrate previous work on ARHGEF6-KO mice. Specifically, we will discuss how reduced interneuron number and altered interneuronal function may contribute to previously reported electrophysiological and behavioral phenotypes, acting in concert with previously described alterations in excitatory neurons and synaptic plasticity (Ramakers et al., 2012).

      Minor comments:

      (1) Figure 1A. It looks clear that the GE shows the highest expression of ARHGEF6; however, the reader needs the reference levels where the log2 expression is calculated. What are the reference levels?

      We would like to thank the reviewer for pointing this out. We will clarify in the caption that the log2(RPKM+1) expression values are shown as absolute values and are not relative to a reference condition.

      (2) Have the authors compared the number of GAD67-eGFP cells in the hippocampal cultures between WT and ARHGEF6-KO mice?

      We did not rely on total GAD67-eGFP counts in dissociated hippocampal cultures because differences could reflect initial plating composition, survival, and maturation. In our experience, the MGE-like organoid system provides a more controlled in vitro context to assess neuronal output in the ventral lineage.

      (3) Section 3, as a caution note, authors should mention that it is not possible to know from the evidence provided which cells are dying.

      We agree with the reviewer and will add a cautionary statement noting that TUNEL staining alone does not identify the precise dying cell type. We will clarify that increased cell death in the ganglionic eminence and MGE-like organoids is consistent with a prominent involvement of the ventral/inhibitory lineage, while acknowledging the limits of the assay.

      (4) In the dorsal-ventral assembloids, it is expected that the ventral organoid would contain lots of GFP expression compared to the dorsal, but in the image shown (Figure 5A) both parts of the assembloid seem to have the same amount and distribution of GFP. How is that possible?

      We appreciate the thoughtful comment of the reviewer. After two weeks of fusion, a considerable number of interneurons are expected to have migrated from the ventral to the dorsal compartment of the assembloid (Birey et al., 2017; Sloan et al., 2018). In terms of distribution, we think that current Figure 5A shows a gradient of eGFP-positive cells within the dorsal compartment, with the number of labeled cells decreasing as the distance from the fusion interface between the two organoids increases. By contrast, a comparable gradient is not evident in the ventral compartment, where several labeled neurons remain present even in regions distal to the fusion site.

      Reviewer #3 (Public review):

      Summary:

      ARHGEF6 is a RAC1/CDC42 guanine nucleotide exchange factor that has been proposed to be associated with X-linked intellectual disability, but its relevance to the pathology is not well established. ARHGEF6 has been assigned a role in spine density and plasticity of hippocampal pyramidal neurons, but nothing is known about its role in interneuron development. Here, the authors show that ARHGEF6 is expressed early in development in the inhibitory lineage during the peak of interneuron generation and migration. The aim of the study is therefore to investigate whether, in addition to its role in pyramidal neurons, ARHGEF6 could play a role in inhibitory neuron development. Using both ARHGEF6-KO mice and organoids from ARHGEF6-KO hiPSCs, the authors show that ARHGEF6 plays a critical role in interneuron development and function

      Strengths:

      The major strength of the paper is the very detailed analysis of the role of ARHGEF6 using two different systems: ARHGEF6-KO mice and deletion of ARHGEF6 in human iPSC-derived organoids. Strikingly, deletion of ARHGEF6 in both systems induces similar defects such as an increase in apoptosis, reduced neuronal output, impaired neuronal morphology, and disrupted migratory dynamics. This compelling evidence demonstrates that ARHGEF6, in addition to its already well-described role in spine formation and plasticity, is playing a crucial role during embryonic development through its function in interneurons.

      We thank the reviewer for this positive assessment of our work and for highlighting the strength of our combined in vivo and human iPSC-derived organoid approaches. We are pleased that the reviewer recognizes the consistency of the phenotypes observed across both systems and acknowledges that our findings support a crucial role, during early stages of embryonic development, for a protein previously thought to be relevant primarily in the synaptic context.

      Weaknesses:

      (1) In Figure 1, the authors show that ARHGEF6 is expressed in different regions of the brain, including the interneuron lineage, and that depletion of ARHGEF6 reduces the number of GABAergic neurons in the adult cortex and hippocampus. To try to better characterize this defect, the authors in Figure 2 investigate whether deletion of ARHGEF6 affects interneuron migration and survival during embryonic development. To do so, ARHGEF6 ko mice were crossed with the GAD67-eGFP reporter line to follow the inhibitory lineage. The authors analyse apoptosis using TUNEL staining, and show that it is significantly increased in the ganglion eminence of ARHGEF6-KO E14.5 embryos. The authors claim that this is not the case in the cortex. However, the image shown in Figure 2A really suggests that staining is increased. Which part of the neocortex is analysed for quantification? This should be clarified.

      We would like to thank the reviewer for pointing this out. The region analyzed was the same as that used to assess GAD67-eGFP-positive cells in Figure 2F. We will clarify the exact neocortical region used for TUNEL quantification and revise the figure and legend to make the analyzed area explicit. We will also analyze additional animals to improve the accuracy of the analysis.

      (2) In Figure 2F-J, the authors investigate the migration of interneurons by analysing the GAD67-eGFP staining, and clearly show that the migratory abilities of the depleted neurons are reduced. However, the authors do not discuss the fact that, because depletion of ARHGEF6 increases apoptosis, there are fewer neurons available for migration. This is important for the interpretation of the data. This point should be clarified.

      We appreciate this comment and believe that it is particularly relevant to the interpretation of the data shown in Figure 2F–G. We will clarify the limited interpretation of this specific analysis in the Results section. The altered directionality observed in vivo, together with evidence of impaired migratory behavior obtained through in vitro live imaging, supports the possibility that altered migratory dynamics contribute to the phenotype, although increased apoptosis and reduced neuronal output may also contribute.

      (3) In Supplementary Figure S2, the authors describe the establishment of the ARHGEF6-KO human iPSC line and test the ability of these cells to undergo correct development, especially for the generation of neural progenitor cells. I was wondering why the authors do not present the data of both control and ARHGEF6-KO cells.

      We thank the reviewer for pointing this out. All staining reported in the organoids and assembloids in this paper shows that the WT ATCC-DYS0100 cell line, as well as the mutant, efficiently differentiates into neuronal tissue. The Supplementary Figure was intended to validate the impact of the mutation on the ability of the iPSC line to retain its differentiation capacity as a preliminary step before proceeding with organoid differentiation. We will integrate stainings for NPC markers on the WT line in the Supplementary Figure.

      (4) At the molecular level, how ARHGEF6 depletion could affect neuronal survival is missing. In addition, as ARHGEF6 is a GEF for RAC1 and Cdc42 amongst other GEFs, I would have expected that the authors test how RAC1 activity (and Cdc42) is affected in ARHGEF6-depleted brains and in ARHGEF6-KO organoids. The measure of phalloidin staining and the anisotropy index are not really meaningful.

      We appreciate the thoughtful comment of the reviewer. Previous evidence already shows that Arhgef6 loss reduces the activity of both GTPases and deregulates the expression of the cytoskeletal regulators Pak1–3, Limk1, and Cofilin in the mouse brain (Ramakers et al., 2012). Regarding organoids, we agree that direct RAC1/CDC42 activity measurements would have strengthened the molecular mechanism. We will revise the manuscript to avoid implying that our phalloidin-based measurements alone establish the underlying dysregulated molecular pathway.

      (5) The authors show that ARHGEF6-KO forebrain organoids were markedly smaller compared to their isogenic controls, and their study suggests that ARHGEF6 expression impacts progenitor maintenance and neurogenesis. Despite representing only a minority of the total neuronal population, I was wondering whether ARHGEF6-KO mice present brain morphology defects such as microcephaly.

      We appreciate the comment. We did not perform a morphometric analysis for microcephaly in the present study. We will add this limitation to the Discussion and note that gross brain morphology changes were not reported in the previously published ARHGEF6-KO mouse characterization (Ramakers et al., 2012). We will also clarify that the smaller organoid phenotype may reflect developmental defects that may reflect developmental defects that are not fully compensated in a reductionist in vitro model and therefore do not necessarily imply overt microcephaly in vivo.

      References

      Allen Institute for Brain Science. Allen Mouse Brain Atlas: Arhgef6 ISH data. Available from: Allen Brain Map.

      Birey, F., Andersen, J., Makinson, C. D., Islam, S., Wei, W., Huber, N., Fan, H. C., Metzler, K. R. C., Panagiotakos, G., Thom, N., O’Rourke, N. A., Steinmetz, L. M., Bernstein, J. A., Hallmayer, J., Huguenard, J. R., & Pașca, S. P. (2017). Assembly of functionally integrated human forebrain spheroids. Nature, 545(7652), 54–59. https://doi.org/10.1038/nature22330

      Liaci, C., Camera, M., Zamboni, V., Sarò, G., Ammoni, A., Parmigiani, E., Ponzoni, L., Hidisoglu, E., Chiantia, G., Marcantoni, A., Giustetto, M., Tomagra, G., Carabelli, V., Torelli, F., Sala, M., Yanagawa, Y., Obata, K., Hirsch, E., & Merlo, G. R. (2022). Loss of ARHGAP15 affects the directional control of migrating interneurons in the embryonic cortex and increases susceptibility to epilepsy. Frontiers in Cell and Developmental Biology, 10, 875468. https://doi.org/10.3389/fcell.2022.875468

      Nodé-Langlois, R., Muller, D., & Boda, B. (2006). Sequential implication of the mental retardation proteins ARHGEF6 and PAK3 in spine morphogenesis. Journal of Cell Science, 119(23), 4986–4993. https://doi.org/10.1242/jcs.03273

      Pelkey, K. A., Chittajallu, R., Craig, M. T., Tricoire, L., Wester, J. C., & McBain, C. J. (2017). Hippocampal GABAergic inhibitory interneurons. Physiological Reviews, 97(4), 1619–1747. https://doi.org/10.1152/physrev.00007.2017

      Ramakers, G. J. A., Wolfer, D., Rosenberger, G., Kuchenbecker, K., Kreienkamp, H.-J., Prange-Kiel, J., Rune, G., Richter, K., Langnaese, K., Masneuf, S., Bösl, M. R., Fischer, K.-D., Krugers, H. J., Lipp, H.-P., van Galen, E., & Kutsche, K. (2012). Dysregulation of Rho GTPases in the αPix/Arhgef6 mouse model of X-linked intellectual disability is paralleled by impaired structural and synaptic plasticity and cognitive deficits. Human Molecular Genetics, 21(2), 268–286. https://doi.org/10.1093/hmg/ddr457

      Sloan, S. A., Andersen, J., Pașca, A. M., Birey, F., & Pașca, S. P. (2018). Generation and assembly of human brain region-specific three-dimensional cultures. Nature Protocols, 13(9), 2062–2085. https://doi.org/10.1038/s41596-018-0032-7

      Yao, Z., Nguyen, T. N., van Velthoven, C. T. J., Goldy, J., Sedeno-Cortes, A. E., Baftizadeh, F., Bertagnolli, D., Casper, T., Chiang, M., Crichton, K., Ding, S.-L., Fong, O., Garren, E., Glandon, A., Gouwens, N. W., Gray, J., Graybuck, L. T., Hawrylycz, M. J., Hirschstein, D., … Zeng, H. (2021). A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation. Cell, 184(12), 3222–3241.e26. https://doi.org/10.1016/j.cell.2021.04.021

    1. eLife Assessment

      This important study examines the effects of diet and exercise on brain structure and behaviour in the 3xTg mouse model of Alzheimer's disease. They show that combined access to a low-fat diet and exercise improves regional brain volume and behaviour in transgenic and wild-type control mice in a sex-specific manner, with analyses linking functional improvements to glucose homeostasis. Although some claims are well supported, the overall strength of the evidence is incomplete and hampered by a lack of clarity regarding the statistical analyses chosen. The work may be of interest to researchers studying neurodegenerative disease, particularly in preclinical contexts.

    2. Reviewer #1 (Public review):

      A triple-transgenic (3xTgAD) mouse model of Alzheimer's disease was exposed to a high-fat diet and assigned to one of three interventions: voluntary physical activity, a low-fat diet, and their combination. A high-fat diet significantly increased body weight and induced widespread neuroanatomical changes, with effects modulated by sex and genotype. The combined intervention led to significant weight loss in males of both genotypes. Neuroanatomical analyses revealed that a high-fat diet significantly reduced hippocampal and cerebellar volumes in wild-type mice but had a less pronounced effect on 3xTgAD mice; nevertheless, interventions, particularly the combined approach, increased localized brain volumes in these regions regardless of genotype. Spatial gene enrichment analysis of this pattern identified glucose homeostasis. Overall, these findings suggest that voluntary physical activity and a low-fat diet can modulate brain structure and behaviour, partially counteracting the effects of a high-fat diet, and potentially recruiting biological processes that may support brain health.

      The authors describe studies of the 3xTg mouse model of Alzheimer's disease (AD). They set out to study the interactions of diet and exercise on three outcomes: weight gain, MRI, and either the novel object recognition or Morris water maze tasks of memory.

      They conclude there are sex and genotype effects on hippocampal volume.

      There are several strengths to the study. First, they start out with a great deal of mice. Once they are divided into groups, the sample sizes are not always strong, however. It would be good to know that they were sufficiently powered.

      The data are also interesting. Mice were placed on several different diets during the study, which will be of interest to many who question the role of diet in outcomes. They also add exercise as an intervention, and study not only diet but also the combined effect of diet and exercise. This is relevant to those interested in controlling dementia by diet and exercise. Finally, they perform some very interesting analyses to study the data.

      That said, the study also has several limitations. For example, it is quite complex. Mice had a standard diet until 2 months of age, then were switched to either a low-fat or a high-fat diet. Some mice had both a different diet and exercise. MRI was performed at 2, 4, and 6 months, when behavior was tested. A drawback of this design is that no assessment of outcomes relevant to this animal model, such as amyloid-beta or tau phosphorylation, was conducted. Also, they used the novel object recognition task, despite stating in the Discussion that this task does not show impairments until well after 6 months of age. They added exercise, but it is not clear whether the animals used the exercise apparatus equally. Also, the animals were housed "communally", so adding an exercise wheel may have made the cage crowded, adding stress to the study. The diets were not simply low- or high-fat because many constituents besides fat content also changed. Regarding fat, the type of fat also changed between diets. Therefore, the gut microbiome was probably affected differently by factors other than fat intake. There was no measurement of food consumption, so some mice may not have eaten as much of the new diet as they did of the old diet they were used to.

      Regarding the data, only the outcomes of complex analyses are shown. One would first want to see the changes in body weight and perhaps later how it is analyzed in a more complex way. For behavior, one would first want to see outcomes as typically presented. For example, learning, recall, platform test results from the Morris water maze, and discrimination indices for object recognition. Note that, at one point, I believe the authors note that some groups did not explore thoroughly, which would make novel object recognition hard to interpret. If there was any difficulty with ambulation, both tasks would be hard to interpret.

      Regarding MRI, from what can be seen, structures cannot be distinguished clearly. At least some raw data should be shown to demonstrate this and to determine what the data show. The raw data suggest that some of the larger structures can be distinguished, and we should see the data for these areas, even if all areas can't be assessed. Lifestyle interventions can mitigate the effects of diet-induced obesity on body weight, behaviour, and brain anatomy in mouse models. Using a longitudinal design, wild-type and triple-transgenic (3xTgAD) mouse models of Alzheimer's disease were exposed to a high-fat diet and assigned to one of three interventions: voluntary physical activity, a low-fat diet, and their combination. A high-fat diet significantly increased body weight and induced widespread neuroanatomical changes, with effects modulated by sex and genotype. The combined intervention led to significant weight loss in males of both genotypes. Neuroanatomical analyses revealed that a high-fat diet significantly reduced hippocampal and cerebellar volumes in wild-type mice but had a less pronounced effect on 3xTgAD mice; nevertheless, interventions, particularly the combined approach, increased localized brain volumes in these regions regardless of genotype. Multivariate integration of behavioural and neuroanatomical measures identified a brain pattern linking hippocampal and cerebellar volumes to intervention and behavioural performance. Spatial gene-enrichment analysis of this pattern identified biological processes, including glucose homeostasis, as potential biological mechanisms underlying intervention effects. Overall, these findings suggest that voluntary physical activity and a low-fat diet can modulate brain structure and behaviour, partially counteracting the effects of a high-fat diet, and potentially recruiting biological processes that may support brain health. In the end, the authors focus primarily on the hippocampus and discuss the cerebellum, but it seems that changes occur throughout the brain. The choice to focus on the hippocampus and cerebellum needs to be supported.

      To gain further insight, the authors analyze genes across different brain regions using the Allen Brain Atlas. Although this seems reasonable in theory, once one realizes how many genes are shared across diverse brain regions, one wonders how such an analysis was conducted. More understanding of this approach, as well as how it was validated, is important. In the end, the authors conclude that the glucose homeostatic pathways were primarily altered, and one would like to understand whether that is indeed true and whether it is the only set of pathways that were changed.

      This raises another point: what occurs in a normal wild-type mouse on the standard diet during the first 6 months of life? Do the glucose homeostatic pathways change simply due to age? Sex? It may be that, with age, the mice become more sedentary, which is why. Once that is resolved, what occurs on the standard diet for the 3xTg mice? Perhaps they are more active or more sedentary, regardless of diet or exercise? Thus, the studies end up raising more questions than answers.

      Given so much work has already been done, it seems best to simply reorganize the presentation with raw data first, followed by the analysis. For the second section, the implicit assumptions of the analyses should be very clear so that the analyzed data are understood and believable. Limitations of the assumptions, pooling some groups, etc., need to be clear.

      Figures. In Figure 1, the weekly measurements are not shown. The points are connected, so an unbroken line is shown. Around the line are lighter lines indicating errors, but with all the lines and colours, one does not know what standard errors surround the values for any given group. This makes the data hard to interpret. In later figures, significant differences are indicated with asterisks, but this seems to be done inconsistently.

      In the text, more caution is needed for some assertions. For example, it is not clear that a 2- to 6-month-old is an adolescent. Opinions about the ages of mice that correspond to human life stages have always been debated. Another example is indicating that male mice might gain weight differently than females, as if it were an outcome of diet or exercise. This is because male rodents continue to gain weight in adulthood, but females stabilize because estrogen limits appetite. Additionally, females may not show group differences because they are more variable. This can relate to their estrous cycle. If stressed or housed without males nearby, they may not have a regular estrous cycle, which can then affect their outcomes. This may be particularly true for behavior when they may have been tested during different estrous cycle phases, if they had estrous cycles.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript describes an investigation into the effect of diet and exercise interventions in WT and transgenic (male and female) mice who are exposed to either a high-fat or a low-fat diet. The outcome variables include MRI volume and brain morphology, as well as memory performance. First, this study measured the impact of genotype (WT vs 3xTgAD mice), then examined the impact of a high-fat or low-fat diet in each group, and finally examined the impact of a low-fat diet, exercise, or a combined low-fat diet and exercise intervention. This is an important study as it allows us to better understand how changes to lifestyle can affect neurocognitive function and potentially change a person's AD risk.

      Strengths:

      (1) The study uses a well-controlled longitudinal design, allowing the authors to track how diet and exercise interventions influence brain and behaviour over time.

      (2) The integration of multiple levels of analysis (brain imaging, behaviour, and multivariate modelling) provides a rich and comprehensive assessment of intervention effects.

      (3) The inclusion of both genotype and sex as key variables strengthens the relevance and interpretability of the findings, given known differences in risk and response across groups.

      Weaknesses:

      There are a lot of analyses in this paper, and I had a little bit of trouble distilling the major take-home messages. For example, I was left wondering:

      (1) If the effect of genotype and the effect of the high-fat diet were consistent in the current study compared to the authors' previous work (e.g. Rollins et al., 2019). A more direct report on the consistency of these findings (maybe even an overlap map, if possible) would benefit the reader.

      (2) How consistent/different are the volumetric and morphometric (DBM) results from each other? Especially in the regions of interest (hippocampus and cerebellum), are increases in volumes always related to "expansion" of a given region using DBM? Some of the similarities are reported in the results, but for transparency, a side-by-side table comparing the results across techniques for each effect of interest might provide more clarity.

      (3) I was interested in the Partial Least Squares approach that the authors used to investigate how patterns of brain measures relate to the behavioral variables. Because they are presented mostly in the supplement (except for Figure 6E), it's difficult to map the LVs described onto the univariate contrasts in Figures 2-5. In general, greater clarity is needed regarding how the PLS-derived latent variables relate to the univariate findings, and whether the emphasis on LV3 reflects a principled selection or post hoc interpretation.

      (4) If I understand the results correctly, there were only modest differences in behavior reported, and the patterns were somewhat inconsistent across sex and genotype. In fact, the authors report that the high-fat diet alone did not impair memory on the Morris Water maze (line 323). The discrepancy between robust neuroanatomical effects and relatively modest behavioural changes raises important questions about the functional significance of the observed structural alterations.

      (5) On line 507, the authors state, "Notably, 3xTgAD mice already show smaller brain volumes at baseline, which may constrain the detectable impact of the diet." Is this true for the entire brain or just the hippocampus and cerebellum? Would a global reduction in brain volume due to the 3xTgAD AD model affect the interpretation of the intervention effects?

    4. Reviewer #3 (Public review):

      Summary:

      The authors sought to determine the individual and combined effects of exercise and low-fat diet consumption on regional brain volume and cognitive function in triple-transgenic Alzheimer's disease mice and wild-type controls.

      Strengths:

      (1) A strength of this study is its longitudinal design, which captures regional changes in brain volume across the interventions tested.

      (2) Its comprehensive design includes 10 groups and is well-powered to isolate genotype-, sex-, diet- and exercise-related effects (and interactions).

      (3) The analyses of volumetric and voxel-based measures are comprehensive.

      Weaknesses:

      (1) Use of automated tracking for NOR data reduces confidence in the behavioural data.

      (2) No measures of Ab or tau pathology appear to be performed.

      (3) Mice from the critical 'combined' intervention groups are not included in the PLS regression model that integrates behavioural and brain data.

      (4) Analyses of behavioural data include a large number of variables without adequate justification.

    1. eLife Assessment

      The manuscript by Rotsides et al. reports the design and validation of SMART-MR1, a miniaturized MR1 metabolite-display platform in which the α1/α2 ligand-binding domain is stabilized by a synthetic helical domain in place of the α3 domain and β2-microglobulin. Supported by biochemical, biophysical, and structural approaches, including ITC, NMR, and cryo-EM, the work provides solid evidence that SMART-MR1 retains native-like ligand binding and A-F7 TCR recognition while enabling experimental approaches for ligand screening that are difficult with conventional MR1 constructs. The study is valuable for the MR1 and MAIT-cell fields, particularly as a tool for ligand screening and mechanistic studies of MR1-restricted antigen presentation. There are several suggestions to further strengthen the study's impact, including clearer benchmarking against existing MR1 platforms, broader validation across ligands and TCRs, and functional evidence from MAIT-cell staining or activation assays.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents an Important tool for the study of MR1 antigen binding, opening new possibilities, and cutting-edge techniques. The evidence supporting the claims of the authors is solid, although including some functional experiments using primary T-cells would also provide a more complete physiologic evaluation. The work will be of interest to T cell immunologists, in general, especially those studying unconventional T cells.

      Strengths:

      In this study, the authors developed a single-chain MR1-derived protein by exchanging the α3 domain and β2-microglobulin for a helical stabilizing domain that they had previously developed. The aim was to generate a more compact structure that would still fold properly, without the risk of losing β2-microglobulin. This overall more robust structure would facilitate ligand exploration using various cutting-edge biophysical techniques.

      The authors successfully demonstrated that their construct folds similarly to native MR1 and retains the ability to bind MAIT TCR in solution, as shown by cryo-EM experiments. Its melting temperature was equivalent to that of the native protein. Importantly, the construct enables the use of differential scanning fluorometry and transverse relaxation-optimized spectroscopy, which represent the main strengths of this work. These approaches should greatly facilitate the screening of additional unknown ligands and enable interaction mapping.

      Weaknesses:

      One possible area for improvement would be to extend the validation to additional known ligands, particularly weaker binders. Furthermore, although the cryo-EM data are highly convincing, including either MAIT cell staining or MAIT activation assays with the generated construct would provide stronger functional validation of its equivalence to the wild-type protein with respect to ligand-binding properties.

      Overall, this work is of great interest to the field, as several groups worldwide are seeking to identify endogenous/tumour-derived MR1 ligands. In addition, some pathogens lacking the capacity to produce 5-OP-RU have been shown to activate MAIT cells, raising the possibility that unknown pathogen-derived ligands may also exist.

    3. Reviewer #2 (Public review):

      Summary:

      The authors develop a miniaturized MR1 construct (SMART-MR1) in which the α1/α2 platform is stabilized by a synthetic domain, and show that it can bind ligands, engage a cognate TCR, and recapitulate native-like recognition by cryo-EM.

      Strengths:

      The work is well-written, technically strong and carefully executed. The authors combine biochemical, biophysical and structural approaches, including ITC, NMR and cryo-EM, to show that SMART-MR1 behaves in a manner closely resembling native MR1. The reduction in size and the demonstration of solution NMR are clear practical advantages for certain types of mechanistic studies.

      Weaknesses:

      The main limitation is that the manuscript does not clearly establish a practical advantage over existing MR1 formats, such as single-chain MR1-β2M or previously described stabilized constructs. The comparison is largely framed against native MR1, which risks overstating the problem, and on the basis of the data presented, it is unlikely that other researchers will adopt this system. In addition, the choice of the A-F7 TCR as a validation reagent may overestimate the generality of the approach, as this receptor is known to exhibit relatively broad ligand tolerance, including recognition of MR1 presenting vitamin B6 metabolites (PDB 9CGR) and structurally diverse synthetic ligands. The extent to which SMART-MR1 supports recognition by a broader range of MR1-restricted TCRs is not addressed.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript describes the engineering, production and validation of an MR1 variant with enhanced suitability for screening of ligands and biophysical and structural analysis. The authors utilize a previous advance from their laboratory on a classical MHC (HLA-A2) whereby the alpha 3 and b2m domains are replaced by a helical stabilizing domain.

      Strengths:

      This variant has a smaller molecular weight than the native MR1, can be produced easily through refolding and is thus much more suitable for NMR analysis. The authors provide data demonstrating that many of the parameters typically evaluated in protein biochemistry/biophysics are similar to reported values between this engineered variant and the wild-type protein. Overall, this is a significant advance to the MR1 field and more broadly to MR1 relevance in immunology and cancer biology, as this will accelerate high-throughput screening and discovery of disease-relevant ligands for MR1, which have been overshadowed by the misguided fixation on 5-OP-RU.

      Weaknesses:

      Minor concerns about the lack of comparison with the native MR1 extracellular domain construct in the validation of this engineered construct.

    1. eLife Assessment

      The study by Izquierdo and colleagues provides important insights into the field of genomic and transcriptomic prediction of traits across multiple environments. The rationale and analyses conducted to integrate the two types of ~omics datasets across two environments are solid. However, some clarification would be appreciated in the presentation of the results, and adding some statistical control to clarify how the predictors were selected, or assessing their importance using the SHAP framework, would further consolidate the findings.

    2. Reviewer #1 (Public review):

      Summary:

      P. Izquierdo et al. investigated the genetic determinism of various traits of interest in switchgrass using large-scale genomic and transcriptomic data. More specifically, they worked on a diversity panel comprising 426 genotypes evaluated in common-garden experiments at two locations (Michigan and Texas). The phenotypic and genomic data were already published. In this work, they produced transcriptomic data for each of the 426 genotypes at each site, and they carried out phenotype predictions using genomic and transcriptomic data separately or together. While they were moderately correlated at each location, both omic information appeared to be complementary for the prediction of phenotype. To further exploit the fact that they have data across two locations, they computed differences for phenotypes and transcripts between locations as indicators of trait and transcript plasticity, respectively. They built predictive models of trait plasticity using genomic information and transcript plasticity, which proved to be quite accurate for traits affected by GxE. Finally, they made use of SHAP values from predictive models of flowering time and biomass at each location, as well as for their plasticity, to gain insight into their genetic determinism. These SHAP values provide the importance of the predictive features (SNP and/or transcripts) for trait prediction. This allowed them to confirm some candidate genes and to propose new candidates for both traits.

      Strengths:

      I found this study interesting and rich. I think the sample size (426 genotypes) is large enough to support the findings. The use of a modern machine-learning approach (XGBoost) together with SHAP indices to find interesting features and get insights into the biological mechanisms underlying flowering time and biomass production is quite original. The methodology employed is globally sound. I also like the fact that the authors accounted implicitly for the population structure by providing a baseline prediction using the first 5 PCs.

      Weaknesses:

      While the methodology is globally sound, I sometimes had difficulties following exactly what was done. This is partly due to the fact that the authors used 2 omics (SNPs and transcripts) to predict phenotypes, and sometimes, in the results, it is not clear which of the 2 is the focus. This was especially the case for the importance of the features and the interpretability of the models, where I found it sometimes hard to tell whether the analysis was done on SNPs or transcripts.

      Also, regarding the methodology, I did not understand why the authors needed to perform a feature selection approach. Maybe it was required to perform the interaction analysis, which could not be deployed on all the features? But regarding the importance of the features, I do not get the added value of the selection over the direct use of SHAP indices when using all features. Maybe this is because I am not a specialist in this kind of approach, but maybe the authors could add more details to explain the rationale behind the feature selection.

    3. Reviewer #2 (Public review):

      Summary:

      The authors aimed to evaluate whether integrating genomic (SNP) and transcriptomic information with machine learning can improve phenotypic prediction of polygenic traits across environments. The manuscript explored not only the predictability across models and predictor feature sets, but also attempted to identify meaningful genes and interactions underlying trait variation.

      Strengths:

      The main strength of the manuscript is its integration of SNP, transcriptomic, and phenotype datasets for 426 sorghum genotypes between Texas and Michigan. It provides a systematic comparison of predictor types (SNP versus transcriptomic abundance) and model strategies to integrate them.

      Weaknesses:

      (1) Experimental Design

      The experimental design raises several concerns that should be clarified before strong biological conclusions are drawn from the transcriptomic analyses.

      First, the transcriptomic sampling is not well aligned with the developmental stages most relevant to the phenotypes being modeled. Leaf tissue was collected at a single time point in each environment, whereas traits such as flowering time, biomass, tiller count, and panicle height arise from developmental processes occurring over extended and potentially distinct temporal windows. Consequently, the measured expression profiles are likely to reflect physiological states specific to the sampling dates (May 5-6 in Texas and June 22-24 in Michigan) rather than the regulatory processes underlying the target phenotypes.

      Second, the phrase "haphazardly randomized" is questionable for a field experiment. It is unclear whether the design included formal randomization, blocking, row/column structure, or spatial correction. Without explicit accounting for spatial field heterogeneity, environmental variation within sites may confound genotype and transcriptomic effects.

      Third, the Methods do not clearly describe biological replication for RNA-seq. If each genotype-by-environment combination were represented by a single transcriptomic sample, then within-genotype expression variance cannot be estimated. This is important because transcript abundance is highly sensitive to microenvironment, sampling time, tissue status, developmental stage, and technical variation. The absence of replication significantly weakens confidence in gene-level feature importance and gene-gene interaction claims.

      Four, the analysis of expression differences across environments is based on a simple subtraction (TX - MI) followed by correlation with genetic similarity. This approach is not standard in transcriptomic analysis and does not account for variability, replication, or statistical uncertainty. Conventional methods for assessing differential expression and genotype-by-environment interactions rely on model-based frameworks that explicitly estimate variance components and test for interaction effects. Without such modeling, the observed expression differences may reflect noise or confounding factors rather than genotype-driven responses.

      (2) SHAP contribution values

      Although SHAP is a well-established framework for decomposing model predictions into feature-level contributions, its use in this manuscript raises several concerns regarding interpretation, statistical validity, and biological inference.

      First, SHAP values quantify the contribution of features within the fitted model, conditional on the joint distribution of inputs and the model structure. They do not represent causal effects or direct biological importance. There is a difference where SHAP values are often in log-odds and the regression model uses absolute units. Without a fair evaluation of model fit, the interpretation of SHAP values needs to take a cautious step because a model could fit poorly when a feature shows very high SHAP values.

      In genomic data, where features are highly correlated due to linkage disequilibrium and co-expression, SHAP values can distribute contribution values across correlated variables in ways that are not uniquely identifiable. As a result, features highlighted as "important" may reflect correlation structure rather than true functional relevance.

      This correlative structure can be exacerbated in this manuscript because of the use of TPM-normalized transcript abundances as predictor variables without biological replicates. Assume the estimates of transcript abundances are robust, TPM values are compositional, with a constant-sum constraint that creates dependencies among all genes that induce negative correlations. This issue is particularly relevant for the interpretation of gene importance and interaction effects, where correlated predictors can lead to unstable and non-unique attributions. This biological interpretation of transcript-based features remains uncertain.

      (3) Result interpretation

      For example, in page 11, "plasticity SNP- and transcriptomic-based models generally outperformed single-environment models for traits with low cross-environment correlation, such as green-up (Fig. 2c, r = -0.13, p < 8.3 × 10⁻³) and tiller count (Fig. 2f, r = -0.08, p = 0.1) (Supplementary Fig. S1).", is too broad. For green-up, the Diff model appears much better than MI, but not clearly better than TX.

      And, same page 11, "...Diffexp was more predictive than SNPs for trait plasticity in biomass, flowering time, and tiller count..." only holds true for biomass, not flowering time, or tiller count.

      The aspect of "complementary information" between SNP and transcriptomic models in page 12 is stronger than what is supported by Figure 2. Figure 2 shows different predictive performance, but it does not by itself demonstrate complementarity. Establishing complementarity requires evidence that combining SNP+T improves prediction consistently or captures distinct, non-overlapping signals. Yet the preceding section says SNP+T outperformed either single data type in only 15% of cases, with modest gains. This is confusing. Also, there was not G+T in Figure 2; it is SNP+T.

    1. eLife Assessment

      This Review Article provides an overview of circadian findings obtained using the zebrafish model and will be of particular interest to researchers working with zebrafish in chronobiology and behavioural neuroscience. The article would benefit from a broader conceptual framework that more clearly positions zebrafish within the wider landscape of animal models used in circadian biology, including comparisons with other extensively studied systems. In addition, several citation inaccuracies and interpretational issues identified during peer review should be carefully addressed to strengthen the accuracy and impact of the review.

    2. Reviewer #1 (Public review):

      Summary:

      Wang Liao and colleagues aim to provide a comprehensive synthesis of zebrafish circadian research, with particular emphasis on the decentralized photoreceptive architecture that distinguishes teleosts from mammals, and to outline future research directions leveraging emerging technologies for translational applications. The authors frame zebrafish as occupying a "crucial evolutionary and experimental niche" and argue that the model system is uniquely suited to address open questions in chronobiology.

      Strengths:

      The review is broad in scope and up to date in its citation of recent primary literature. The coverage of physiological outputs - spanning cardiovascular rhythmicity, hepatic metabolism, immune function, reproduction, and gut homeostasis - is more comprehensive than many existing reviews in this area, and researchers seeking an entry point into any of these subfields will find a useful orientation. The figures are well-designed and effectively summarise complex regulatory relationships. The section on immune rhythmicity is a particular strength, providing mechanistic detail on how specific clock components (Clock1a, Per1b, Per2, Cry1a) differentially regulate neutrophil behaviour, bacterial killing, and cytokine expression; this level of molecular specificity distinguishes it from comparable sections in the review. The brief discussion of non-canonical clock gene functions (CLOCK in neuronal connectivity, BMAL1 in stem cell state, vascular calcification) raises genuinely interesting points that are underexplored in the field and might deserve more prominence.

      The future perspectives section makes a conceptually interesting move in suggesting that the zebrafish decentralized architecture could reframe a central question in chronobiology - from how a master clock imposes order on passive peripheral oscillators, to how semi-autonomous oscillators achieve coherence. This is the most original conceptual contribution in the manuscript, and it would benefit from much further development.

      Weaknesses:

      The core limitation of this review is that it functions primarily as an annotated bibliography rather than a critical synthesis. Section after section follows the same pattern: a physiological system is introduced, several findings from recent papers are described in sequence, and the section ends. Missing throughout is an evaluative voice - where does the field agree, where does it disagree, which findings have been replicated versus remain preliminary, and which conceptual questions are genuinely unresolved versus merely unstudied? Readers with expertise in the field will find little that reframes their understanding; readers new to the field will receive information but not the interpretive scaffolding needed to assess its significance.

      The framing of zebrafish as occupying a "crucial evolutionary and experimental niche" is asserted but not substantiated. The experimental advantages of zebrafish - optical transparency, external development, genetic tractability - are real, but they apply primarily to larval stages, typically the first two weeks of development. The review does not adequately address whether the key features it highlights, particularly peripheral photosensitivity and autonomous peripheral oscillators, have been demonstrated in adult animals, where optical transparency is lost. Many of the physiological findings described (sleep-wake cycles, cardiovascular function, reproduction, and immune function) are most relevant in adult or juvenile fish, yet the mechanistic underpinnings often come from larval studies. Whether the mechanisms generalise across developmental stages is not discussed, and this is an important gap that the review could acknowledge explicitly.

      The claim that zebrafish bridge invertebrate and mammalian models is a conventional framing that appears in most zebrafish review articles; its repetition here adds little. More interesting - and underexplored - is the comparative question of how the decentralised clock architecture of teleosts compares with that of other non-mammalian vertebrates, or indeed with invertebrate systems such as Drosophila, where peripheral tissue clocks and non-visual photoreception have also been studied. The review does not engage with this comparative dimension, which would be the natural intellectual context for the claims being made.

      The future perspectives section identifies several promising directions - optogenetic circuit mapping, whole-body longitudinal imaging, inter-organ communication, network modeling - but these are described at a high level of generality. Most are not specific to the questions raised by the zebrafish decentralized clock architecture; they would appear in any forward-looking review of circadian biology. The one conceptually distinctive idea - that zebrafish could be used to ask how distributed oscillators achieve coordinated coherence without hierarchical control - is identified but not developed into concrete experimental questions or testable predictions. The discussion of non-canonical clock gene functions in the Future Perspectives section would benefit from being more directly connected to what zebrafish specifically can offer: given that teleost genome duplication has produced additional paralogues of clock genes, there is a concrete opportunity to dissect canonical from non-canonical functions through comparative analysis of paralogues with diverged expression patterns. This point is hinted at but not made explicitly.

      Appraisal of conclusions:

      The conclusions are broadly consistent with the evidence cited, and the authors are appropriately cautious in noting that many signalling cascades and inter-tissue communication mechanisms remain incompletely characterised. The conclusion that zebrafish represents a valuable and underexploited model for circadian-disease translational research is well-supported. However, the review would be significantly strengthened if the authors distinguished more clearly between what is firmly established, what is supported by preliminary or single-study evidence, and what remains genuinely speculative.

      Likely impact and utility:

      This review will be useful as an orientation document for researchers new to zebrafish circadian biology, and the comprehensive treatment of physiological outputs across organ systems is a genuine service to the field. Its impact as an intellectual contribution is limited by the descriptive approach and the absence of original synthesis or conceptual reframing. The most interesting ideas in the manuscript - the reframing of the central/peripheral clock hierarchy question, and the potential of clock gene paralogues for probing non-canonical functions - could be further developed and, if pursued, could form the basis of a more distinctive and impactful contribution.

    3. Reviewer #2 (Public review):

      Summary:

      This review is valuable in principle because circadian rhythms in zebrafish are unexplored and therefore this degree is valuable in principle. There are a number of significant weaknesses that should be addressed for it to have an impact. First, while the review covers a broad range of topics in chronobiology, it does not put them in context. Placing zebrafish work in the context of other model organisms that are better understood and other fish species would broaden the appeal. The review could also expand to a discussion of sleep, where the understanding in zebrafish is much more advanced. Critically, providing a novel framework, identifying new areas of opportunity and limitations of the system would expand the interest to non-zebrafish research groups. In addition, there are a number of misstatements/mis-citations that are critical to correct. Therefore, I find this review potentially impactful, but its current form is likely to limit its impact.

      Strengths:

      Focusing on decentralized photo sensing is a strength because it is relatively unique to zebrafish.

      The breadth of discussion in zebrafish is a strength.

      Weaknesses:

      It might be helpful to reorganize the review with an introduction on what is known in other better studied systems to be highly conserved, then to focus in on the components of zebrafish that are discussed here.

      A weakness is the lack of integration with other model organisms and other fish systems. Therefore, the narrow focus on zebrafish is unlikely to appeal to broader audiences.

      It's surprising that there is not more discussion of sleep, which has been studied in detail, and its relationship to the clock.

      Discussions of limitations of the model, including adult vs larval analysis and challenges performing long-term behavioral analysis in fish, would be valuable.

    4. Reviewer #3 (Public review):

      Summary:

      Over the past 3 or 4 decades, our understanding of the molecular mechanism underlying the circadian clock has increased substantially. This is in large part due to successful forward and reverse genetics approaches applied to a broad range of genetic model systems, notably Drosophila, Neurospora, mouse, Arabidopsis and cyanobacteria. Although the clock components in these species are diverse, the basic operating principles are highly conserved, allowing us to build a general view of clock mechanisms. Looking forward, there are still many unanswered questions regarding how clocks are organized at the systems level and, in turn, how they are coupled to key aspects of physiology. Each model species has its own set of advantages and disadvantages for tackling particular questions. As this timely review aims to illustrate, the zebrafish has become a particularly valuable model for exploring circadian clock biology. This is in part due to its technical advantages, accessibility of early developmental stages and its directly light-entrainable peripheral clocks. This provides unparalleled opportunities for studying the circadian clock hierarchy and its links with physiology.

      Strengths:

      This review does a good job of integrating the many lines of circadian clock research where the zebrafish has been used as a model and provides an overview of many future challenges it is well-suited to tackle.

      Weaknesses:

      There are citation errors, as well as inaccurate and misleading statements that must be remedied in a revised version.

    5. Author response:

      We sincerely thank the reviewers and editors for the thorough, constructive, and insightful comments, which have greatly helped us improve the accuracy, clarity, and rigor of the manuscript. We acknowledge that the current version has several limitations, including insufficient contextualization with other model systems and lack of critical synthesis. These important weaknesses will be comprehensively addressed in a future revised version of the review.

      For the present revision, we have focused exclusively on correcting objective errors, factual inaccuracies, and citation mistakes as pointed out by the reviewers. All specific factual and reference issues raised by Reviewer 2 and Reviewer 3 have been carefully corrected in the revised manuscript, including inaccurate statements, incorrect citations, missing references, and inconsistent descriptions of zebrafish clock genes, photoreception, and physiological functions.

      We appreciate the reviewers’ thoughtful suggestions regarding the conceptual depth, comparative context, critical synthesis, and expanded discussion of sleep and model limitations. While we fully agree that these aspects would significantly strengthen the review, we plan to systematically incorporate these broader conceptual improvements in a future, more substantial revision.

    1. eLife Assessment

      This important study demonstrates that a perinuclear actomyosin network, present in some types of human cells, facilitates kinetochore-spindle attachment of chromosomes in unfavorable locations, thereby reducing their missegregation rate. This actomyosin network and its general role have been studied previously, but this study convincingly clarifies the underlying mechanism and expands the investigation to additional cell lines. The results are relevant to understanding chromosome missegregation in cancer cells.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #1 (Public review):

      Sheidaei and colleagues report a novel and potentially important role for an early mitotic actomyosin-based mechanism, PANEM contraction, in promoting timely congression of chromosomes located at the nuclear periphery, particularly those in polar positions. The manuscript will interest researchers studying cell division, cytoskeletal dynamics, and motor proteins. Although some data overlap with the group's prior work, the authors extend those findings by optimizing key perturbations and performing more detailed analyses of chromosome movements, which together provide a clearer mechanistic explanation. The study also builds naturally on recent ideas from other groups about how chromosome positioning influences both early and later mitotic movements.

      Comments on revised version:

      In the revised manuscript, organizational issues have been largely resolved. In addition, the inclusion of new experiments in additional cell lines, along with an expanded discussion that places actomyosin contractility in the broader conceptual context of other mechanisms governing chromosome movement, has significantly strengthened the manuscript.

    3. Reviewer #3 (Public review):

      Sheidaei et al. report how chromosomes are favourably positioned to facilitate kinetochore-microtubule interactions during early mitosis. Studying kinetochore capture during early prophase is extremely difficult due to kinetochore crowding, but the team has taken up the challenge by classifying types of kinetochore movements, carefully marking kinetochore positions in early mitosis, and linking these to map their fate/next positions over time. The work is an excellent addition to the chromosome segregation field, as most of the literature has thus far focused on tracking kinetochores at slightly later stages of mitosis. The authors show that PANEM facilitates chromosome positioning toward the interior of the newly forming spindle, which in turn promotes chromosome congression. In the absence of PANEM, chromosomes end up in unfavourable locations and fail to form proper kinetochore-microtubule interactions. The work highlights the perinuclear actomyosin network in early mitosis (PANEM) as a key spatial and temporal element of chromosome congression, a step that precedes the segregation process.

      Comments on revised version:

      The authors' revisions have brought clarity to the description of movements in many of the figures. The manuscript ties a fundamental process to differences in cancer cell lines.

      The work extends their published discovery that an actomyosin network forms on the cytoplasmic side of the nuclear envelope during prophase. The current manuscript explains how this network facilitates chromosome capture and congression by tracking the motions of individual kinetochores during early mitosis. The findings are broadly useful for the cell division and cytoskeletal fields.

    4. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary

      Sheidaei and colleagues report a novel and potentially important role for an early mitotic actomyosinbased mechanism, PANEM contraction, in promoting timely congression of chromosomes located at the nuclear periphery, particularly those in polar positions. The manuscript will interest researchers studying cell division, cytoskeletal dynamics, and motor proteins. Although some data overlap with the group's prior work, the authors extend those findings by optimizing key perturbations and performing more detailed analyses of chromosome movements, which together provide a clearer mechanistic explanation. The study also builds naturally on recent ideas from other groups about how chromosome positioning influences both early and later mitotic movements.

      In its current form, however, the manuscript is not acceptable for publication. It suffers from major organizational problems, an overcrowded and confusing Results section and figures, and a lack of essential experimental controls and contextual discussion. These deficiencies make it difficult to evaluate the data and the authors' conclusions. A substantial structural revision is required to improve clarity and persuasiveness. In addition, several key control experiments and more conceptual context are needed to establish the specificity and relevance of PANEM relative to other microtubule- and actin-based mitotic mechanisms. Testing PANEM in additional cell lines or contexts would also strengthen the claim. I therefore recommend Major Revision, addressing the structural, conceptual, and experimental issues detailed below.

      Major Comments

      A. Structural overhaul and figure reorganization

      The Results section is overly dense, lacks clear structure, and includes descriptive content that belongs in the Methods. Many figure panels should be moved to Supplementary Materials. A substantial reorganization is required to transform the manuscript into a focused, "Reports"-type article.

      Move methodological and descriptive details (e.g., especially from the second Results subheading and Figure 2) to the Methods or Supplementary Materials.

      In these parts, we define four phases of kinetochore motion in early mitosis. Without such a description in the main text, readers would be confused about subsequent analyses. Figure 2 is also important to show examples of how the four phases develop. Although we respect this suggestion from the reviewer, we would like to keep these parts in the main text and main figure.

      Remove repetitive statements that simply restate that later phenotypes arise as consequences of delayed Phase 1 (applicable to subheadings 3 onward).

      As suggested, we have removed the statement for the delayed start of Phase 2 for peripheral kinetochores in azBB-treated cells (Page 9, second paragraph). We have also simplified the statement for the delayed start of Phase 3 and Phase 4 to avoid repetition (Page 9, third paragraph; Page 10, second paragraph).

      Figure 4I: This panel is currently unclear and should be drastically simplified.

      Following this suggestion, we simplified Figure 4I by removing the column of ‘Start’, which is easily deduced from the ‘Duration’ results and therefore does not provide much new information.

      I recommend to reorganize figures as follows:

      Figure I: Keep as single figure but simplify. Figure 1D and 1E could be combined, move unnormalized SCV to supplementary materials. Same goes for 1F.

      We have reorganized Figure 1, as suggested, and moved unnormalized data to supplemental materials.

      New Figure 2: Combine current Figures 2A, 3A, 3C, 3D, 4C, 4F, and 4H to illustrate how PANEM contraction facilitates initial interactions of peripheral chromosomes with spindle microtubules which increases speed of congression initiation.

      If we were to follow this suggestion, we would lose Figure 2B, D, Figure 3B and Figure 4A, where examples of kinetochore motions are shown in images and 3D diagrams. The new Figure would mostly consist of only graphs. Without examples of images and 3D diagrams, readers would have difficulty understanding the study. Although we respect this suggestion from the reviewer, we would like to keep Figures 2, 3 and 4, as they are (except for making Figure 4I simpler; see above).

      New Figure 3: Combine current Figures 5A, 5C, 5D, 5F, 6B, 6C, and lower panels of 4H to show how

      PANEM contraction repositions polar chromosomes and reduces chromosome volume in early mitosis to enable rapid initiation of congression.

      If we were to follow this suggestion, we would lose Figure 5B and Figure 6A, where examples of kinetochore/chromosome dynamics are shown in images and 3D diagrams. For the same reason as above, we would like to keep Figure 5 and 6 as they are, although we respect this suggestion from the reviewer.

      New Figure 4: Combine Figures 7A, 7B, 7D, 7E, 7F, expanded Supplementary Figure S7, and new data to demonstrate that PANEM actively pushes peripheral chromosomes inward which is important for efficient chromosome congression in diverse cellular contexts.

      We have conducted new experiments to demonstrate the role of PANEM in diverse cellular contexts, as detailed below. We have combined the new results with the original Figure S7 to create Figure 8 in line with this suggestion.

      On the other hand, in our view, combining Figure 7A-E and the extended Figure S7 would be confusing because the two parts address different topics. Although we respect this suggestion from the reviewer, we would like to keep Figure 7 and the extended Figure S7 (i.e. Figure 8) separate.

      B. Specificity and redundancy of actin perturbation

      To establish the specificity and relevance of PANEM, the authors should include or discuss appropriate controls:

      Apply global actin inhibitors (e.g., cytochalasin D, latrunculin A) to disrupt the entire actin cytoskeleton. These perturbations strongly affect mitotic rounding and cytokinesis but only modestly influence early chromosome movements, as reported previously (Lancaster et al., 2013; Dewey et al., 2017; Koprivec et al., 2025). The minimal effect of global inhibition must be addressed when proposing a localized actomyosin mechanism. Comment if the apparent differences in this approach and one that the authors were using arises due to different cell types.

      We did experiments along this line, using a dominant-negative LINC construct, in our previous study (Booth et al eLife 2019). LINC-DN should more specifically remove/reduce PANEM than the global actin inhibitors mentioned above. LINC-DN attenuated the reduction of CSV soon after NEBD and increased the number of polar chromosomes (Booth et al eLife 2019); i.e. in this regard, the outcome was similar to azBB treatment in the current study. One can expect that global actin inhibitors would also inhibit the PANEM formation and show effects similar to LINC-DN. By contrast, the indicated references reported that global actin inhibitors strongly affect mitotic rounding and cytokinesis but only modestly influence early chromosome movements, as the reviewer noted. One possibility is that such differences may have arisen from different cell types – this could be important, especially given that some cells form the PANEM and others do not (Figure 8A). A second possibility is that cytokinesis, mitotic rounding and PANEM formation may rely on actin polymerization to different extents. For example, the same concentration of global actin polymerization inhibitors may affect cytokinesis, but may still allow PANEM formation to proceed without observable effects on early chromosome movements. As suggested, we discussed this topic in the Discussion (page 16, third paragraph).

      Clarify why spindle-associated actin, especially near centrosomes, as reported in prior studies using human cultured cells (Kita et al., 2019; Plessner et al., 2019; Aquino-Perez et al., 2024), was not observed in this study. The Myosin-10 and actin were also observed close to centrosomes during mitosis in X.laevis mitotic spindles (Woolner et al., 2008). Possible explanations include differences in fixation, probe selection, imaging methods, or cell type. Note that some actin probes (e.g., phalloidin) poorly penetrate internal actin, and certain antibodies require harsh extraction protocols. Comment on possibility that interference with a pool of Myo10 at the centrosomes is important for effects on congression.

      As the reviewer implies, we cannot rule out that we could not detect actin associated with the spindle or centrosomes because of the difference in methods or cell lines between the current study and the literature mentioned by the reviewer. We have therefore moderated our claim in the Discussion that ‘we did not detect any actin network inside the nucleus, on the spindle or between chromosomes’ by adding ‘at least, using the method and the cell line in the current study’ to this statement (Page 14, second paragraph). We have also cited the three references mentioned by the reviewer in the Discussion (Page 14, second paragraph). Regarding Myosin10, azBB (blebbistatin variant) should have negligible effects on class-X myosin, including Myosin-10 (Limouze et al 2004 [PMID 15548862]). It is therefore unlikely that the effects of azBB that we observed in the current study are due to the inhibition of Myosin-10. We have cited Woolner et al 2008 and another paper and discussed this topic in the Discussion (Page 14, second paragraph).

      C. Expansion of PANEM functional analysis

      To strengthen the conclusions and broaden the study beyond the group's previous work, PANEM function should be tested in additional contexts (some may be considered optional but important for broader impact): [underlined by authors]

      Test PANEM function in at least one additional cell line that displays PANEM to rule out cell-line-specific effects.

      As suggested, we have studied the effect of PANEM contraction in cell lines other than U2OS. We have found that when PANEM contraction was inhibited, the reduction in chromosome scattering was diminished in RPE1 cells (new Figure 8B, C). Moreover, we have found that inhibition of PANEM contraction increased polar chromosomes during prometaphase/ metaphase in RPE1 and HCT116 cells (which form PANEM), but not in HeLa cells (which do not form PANEM) (new Figure 8D, E). These results suggest that the effects of PANEM contraction, originally observed in U2OS cells, are also present in other cell lines (RPE1 and HCT116) that form PANEM.

      Examine higher-ploidy or binucleated cells to determine whether multiple PANEM contractions are coordinated and if PANEM contraction contributes more in cells of higher ploidies or specific nuclear morphologies.

      This is an interesting suggestion, but it takes lots of time to conduct such a study, and it goes beyond the scope of this paper.

      Investigate dependency on nuclear shape or lamina stiffness; test whether PANEM force transmission requires a rigid nuclear remnant.

      This is an interesting suggestion, but it takes lots of time to conduct such a study, and it goes beyond the scope of this paper.

      Analyze PANEM's contribution under mild microtubule perturbations that are known to induce congression problems (e.g., low-dose nocodazole).

      In the current study, we found that PANEM contraction affects chromosome motions in Phase 1 and Phase 3 but not Phase 2 or Phase 4. Mild microtubule perturbation itself could affect chromosome motions in all four Phases. We do not think it would be so informative to study what additional effects the reduced PANEM contraction shows when combined with mild microtubule perturbation.

      Evaluate PANEM contraction role in unsynchronized U2OS cells, where centrosome separation can occur before NEBD in a subset of cells (Koprivec et al., 2025), and in other cell types with variable spindle elongation timing.

      Following this suggestion, we first investigated the timing of spindle elongation, relative to NEBD, in asynchronous U2OS cells (Figure 8 – figure supplement 3). We imaged cells every 5 min (it was difficult to reasonably observe enough mitotic cells using a shorter interval). Most of the cells showed no significant change in the spindle length (distance between two spindle poles) after (or around) NEBD [e.g. Cell 1 in A] or a mild reduction in it [e.g. Cell 2 in A]. Only a small number of cells (2-3 out of 26) showed a mild increase in the spindle length after (or around) NEBD [e.g. Cell 3 in A]. Because the spindle elongation after NEBD was rare and mild, it was difficult to address how the timing of spindle elongation affects the effect of PANEM on reducing chromosome scattering and on chromosome relocation from polar regions. We explained this result and discussed this topic in the Discussion section.

      Quantify not only the percentage of affected cells after azBB but also the number of chromosomes per cell with congression defects in the current and future experiments.

      It is tricky to count the number of chromosomes because they frequently overlap. Counting kinetochores is more feasible, but kinetochore signals show some non-specific background (e.g. those outside of the nucleus in prophase). We therefore quantified the chromosome volume at polar regions in azBB-treated cells (Figure 6C).

      D. Conceptual integration in Introduction and Discussion

      The manuscript should better situate its findings within the context of early mitotic chromosome movements:

      Clearly state in the Introduction and elaborate in the Discussion that initiation of congression is coupled to biorientation (Vukušić & Tolić, 2025). This provides essential context for how PANEM-mediated nuclear volume reduction supports efficient congression of polar chromosomes.

      It has been a widely accepted view in the field that chromosome congression precedes biorientation, since the publication in 2006 (Kapoor et al Science 2006). Very recently, this view has been challenged by the new publication (Vukušić & Tolić, Nat comm 2025), as indicated by this reviewer. We have mentioned this new model and discussed the new interpretation of our results based on this new model, in the Discussion (page 15; ‘It has been a widely accepted view…’).

      To explain the new interpretation of our results more clearly, we have a new diagram as a supplemental figure (Figure 9 – figure supplement 1) in the revised manuscript.

      Explain that PANEM is most critical for polar chromosomes because their peripheral positions are unfavorable for rapid biorientation (Barišić et al., 2014; Vukušić & Tolić, 2025).

      We have included such a statement in the Discussion, as a part of the new interpretation of our results based on the new model that chromosome biorientation precedes congression (see above). We have also cited the indicated two papers.

      Discuss how cell lines lacking PANEM (e.g., HeLa and others) nonetheless achieve efficient congression, and what alternative mechanisms compensate in the absence of PANEM. For example, it is well established that cells congress chromosomes after monastrol or nocodazole washout, which essentially bypasses the contribution of PANEM contraction.

      Following this suggestion, we discussed three possible mechanisms that could compensate for a lack of PANEM and facilitate kinetochore-MT interaction and chromosome congression, based on previous literature (Page 17): 1) the enhanced assembly rate of spindle MTs may facilitate kinetochore-MT interactions in N-CIN+ cancer cells, 2) chromosome biorientation may precede congression more frequently to promote the congression towards the spindle midplane, and 3) the balance between CENP-E, Dynein and chromokinesin’s activities may incline to greater chromosome-arm ejection forces towards the spindle midplane.

      Minor Comments

      These issues are more easily addressable but will significantly improve clarity and presentation.

      Introduction

      Remove the reference to Figure 1A in the Introduction. The portion of Figure 1 and related text that recapitulates the authors' previous work should be incorporated into the Introduction, not the Results.

      As suggested in the second sentence of this comment, we have moved most of the second paragraph of the first section of Results to Introduction (Page 4) and cited Figure 1A and 1B in Introduction. We would like to keep the reference to Figure 1A in the Introduction, because showing the PANEM images at the beginning of the manuscript would help readers’ understanding of our study. In addition, citing Figure 1A in the Introduction is more consistent with the suggestion in the second sentence of this comment.

      Results (by subheading)

      First subheading: When introducing the ~8-minute early mitotic interval, cite additional studies that have characterized this period: Magidson et al., 2011 (Cell); Renda et al., 2022 (Cell Reports); Koprivec et al., 2025 (bioRxiv); Vukušić & Tolić, 2025 (Nat Commun); Barišić et al., 2013 (Nat Cell Biol).

      As suggested, we cited these references at the indicated part of the first section of the Results (page 5).

      Second subheading: Cite key reviews and foundational research on kinetochore architecture and sequential chromosome movement during early mitosis: Mussachio & Desai, 2017 (Biology); Itoh et al., 2018 (Sci Rep); Magidson et al., 2011 (Cell); Vukušić & Tolić, 2025 (Nat Commun); Koprivec et al., 2025 (bioRxiv); Rieder & Alexander, 1990 (J Cell Biol); Skibbens et al., 1993 (J Cell Biol); Kapoor et al., 2006 (Science); Armond et al., 2015 (PLoS Comput Biol); Jaqaman et al., 2010 (J Cell Biol).

      Rieder & Alexander, 1990 (J Cell Biol) and Kapoor et al., 2006 (Science) have already been cited in the second section of the Results in the original manuscript. We agree that all other references should be cited in this manuscript, and they are now cited in the Introduction and/or Discussion where they fit best (e.g. Mussachio & Desai 2017 reviews the kinetochore in general and is therefore best cited in the Introduction).

      Third subheading: Clarify why some kinetochores on Figure 3A appear outside the white boundaries if these boundaries are intended to represent the nuclear envelope.

      We interpret that these are background signals in the cytoplasm, which do not come from kinetochores, because 1) before NEBD, they were outside of the nucleus, and 2) after NEBD, they did not show any characteristic kinetochore motions such as those towards a spindle pole (Phase 2) and the spindle mid-plane (Phase 4). We have commented on these background signals in the legend for Figure 3A.

      Fourth subheading: Note that congression speed is lower for centrally located kinetochores because they achieve biorientation more rapidly (Barišić et al., 2013, Nat Cell Biol; Vukušić & Tolić, 2025, Nat Commun).

      Relevant to this comment, there was an error regarding the congression speed of central kinetochores (original Figure 4H). The congression speed of peripheral kinetochores was shown correctly, but for central kinetochores it was shown incorrectly with µm per time interval (30s) shown, rather than µm per minute. We amended this error in the revised manuscript (new Figure 4H). Based on the corrected data, the speed of congression is similar between peripheral and central kinetochores. The original Figure 3G (the speed of poleward motion for central kinetochores) had a similar error, which we have also corrected in the revised manuscript. We apologize for these errors and the confusion it may have caused.

      Regarding this comment, if biorientation is achieved more rapidly for central kinetochores, Phase 3 (rather than congression speed) would be shorter for central kinetochores. Indeed, Phase 3 is slightly shorter for central kinetochores (control) than for peripheral kinetochores (control) (Figure 4C), but the difference is not statistically significant (t test; p\=0.21).

      Fifth subheading: Cite studies on polar chromosome movements: Klaasen et al., 2022 (Nature); Koprivec et al., 2025 (bioRxiv). Clarify that Figure 5F displays only those kinetochores that initiated directed congression movements.

      These two references have already been cited and discussed in this Result section of our original manuscript. However, considering this suggestion, we have discussed more about polar chromosome movements reported by Koprivec et al (page 11). Meanwhile, the reviewer is correct about Figure 5F, and we have clarified this point in the Figure 5F legend.

      Sixth subheading (currently in Discussion): Move the final paragraph of the Discussion into the Results and expand it with preliminary analyses linking PANEM contraction to congression efficiency across untreated cell types or under mild nocodazole treatment.

      As suggested, we have moved the final paragraph of the Discussion in the original manuscript to make a new final section in the Results in the revised manuscript. Moreover, as suggested, we have studied the outcome of inhibiting PANEM contraction in cell lines other than U2OS (Figure 8 B–E), and have described the new results to the new final section in the Results.

      Discussion

      1. When discussing cortical actin, cite key reviews on its presence and function during mitosis: Kunda & Baum, 2009 (Trends Cell Biol); Pollard & O'Shaughnessy, 2019 (Annu Rev Biochem); Di Pietro et al., 2016 (EMBO Rep).

      As suggested, we have cited all these review papers in the Discussion (page 17), and mentioned the role of the cortical actin on the spindle orientation and positioning (Kunda & Baum, 2009; Di Pietro et al., 2016), as well as the function of the actomyosin ring on cytokinesis (Pollard & O'Shaughnessy, 2019).

      Significance

      Advance

      This study's main strength is its novel and potentially important demonstration that contraction of PANEM, a peripheral actomyosin network that operates contracts early mitosis, contributes to the timely initiation of chromosome congression, especially for polar chromosomes. While PANEM itself was previously described by this group, this manuscript provides new mechanistic evidence, improved perturbations, and detailed chromosome tracking. To my knowledge, no prior studies have mechanistically connected this contraction to polar chromosome congression in this level of detail. The work complements dominant microtubule-centric models of chromosome congression and introduces actomyosin-based forces as a cooperating system during very early mitosis. However, the impact of the study is currently limited by major organizational issues, insufficient controls, and incomplete contextualization within existing literature. Addressing these issues will substantially improve clarity and credibility. [underlined by authors]

      We have addressed the underlined criticisms as detailed above.

      Audience

      Primary audience of this study will be researchers working in cell division, mitosis, cytoskeleton dynamics, and motor proteins. The findings may interest also the wider cell biology community, particularly those studying chromosome segregation fidelity, spindle mechanics, and cytoskeletal crosstalk. If validated and clarified, the concept of PANEM could be integrated into textbooks and models of chromosome congression and could inform studies on mitotic errors and cancer cell mechanics.

      Expertise

      My expertise lies in kinetochore-microtubule interactions, spindle mechanics, chromosome congression, and mitotic signaling pathways.

      Reviewer #2 (Evidence, reproducibility and clarity):

      In this manuscript, Sheidaei et al. reported on their study of chromosome congression during the early stages of mitotic spindle assembly. Building on their previous study (ref. #15, Booth et al., Elife, 2019), they focused on the exact role of the actin-myosin-based contraction of the nuclear envelope. First, they addressed a technical issue from their previous study, finding a way to specifically impair the actomyosin contraction of the nuclear membrane without affecting the contraction of the plasma membrane. This allowed them to study the former more specifically. They then tracked individual kinetochores to reveal which were affected by nuclear membrane contraction and at what stage of displacement towards the metaphase plate. The investigation is rigorous, with all the necessary controls performed. The images are of high quality. The analyses are accurate and supported by convincing quantifications. In summary, they found that peripheral chromosomes, which are close to the nuclear membrane, are more influenced by nuclear membrane contraction than internal chromosomes. They discovered that nuclear membrane contraction primarily contributes to the initial displacement of peripheral chromosomes by moving them towards the microtubules. The microtubules then become the sole contributors to their motion towards the pole and subsequently the midplane. This step is particularly critical for the outermost chromosomes, which are located behind the spindle pole and are most likely to be missegregated.

      Significance

      While the conclusions are somewhat intuitive and could be considered incremental with regard to previous works, they are solid and improve our understanding of mitotic fidelity. The authors had already reported the overall role of nuclear membrane contraction in reducing chromosome missegregation in their previous study, as mentioned fairly and transparently in the text. However, the reason for this is now described in more detail with solid quantification. Overall, this is good-quality work which does not drastically change our understanding of chromosome congression, but contributes to improving it. Personally, I am surprised by the impact of such a small contraction (of around one micron) on the proper capture of chromosomes and wonder whether the signalling associated with the contraction has a local impact on microtubule dynamics. However, investigating this point is clearly beyond the scope of this study, which can be published as it is. [underlined by authors]

      The suggested topic (underlined) is intriguing. However, we agree with the reviewer that it is beyond the scope of this paper. The reviewer recommends publication of our manuscript as it is.

      Reviewer #3:

      Sheidaei et al., report how chromosomes are brought to positions that facilitate kinetochore-microtubule interactions during mitosis. The study focusses on an important early step of the highly orchestrated chromosome segregation process. Studying kinetochore capture during early prophase is extremely difficult due to kinetochore crowding but the team has taken up the challenge by classifying the types of kinetochore movements, carefully marking kinetochore positions in early mitosis and linking these to map their fate/next-positions over time. The work is an excellent addition to the field as most of the literature has thus far focussed on tracking kinetochore in slightly later stages of mitosis. The authors show that the PANEM facilitates chromosome positioning towards the interior of the newly forming spindle, which in turn facilitates chromosome congression - in the absence of PANEM chromosomes end up in unfavourable locations, and they fail to form proper kinetochore-microtubule interactions. The work highlights the perinuclear actomyosin network in early mitosis (PANEM) as a key spatial and temporal element of chromosome congression which precedes the segregation process.

      Major points

      (1) The complexity of tracking has been managed by classifying kinetochore movements into 4 categories, considering motions towards or away from the spindle mid-plane. While this is a very creative solution in most cases, there may be some difficult phases that involve movement in both directions or no dominant direction (eg Phase3-like). It is unclear if all kinetochores go through phase1, 2, 3 and 4 in a sequential or a few deviate from this pattern. A comment on this would be helpful. Also, it may be interesting to compare those that deviate from the sequence, and ask how they recover in the presence and absence of azBB.

      To respond to this comment, we would like to first clarify how we selected kinetochores for our analysis. We selected kinetochores that can be individually tracked. If kinetochore tracking was difficult (before the start of Phase 4 in control and azBB-treated cells or before observing the extended Phase 3 in azBB-treated cells) because of kinetochore crowding, we did not choose such kinetochores. For example, related to the next comment of this Reviewer, we did not include kinetochores close to spindle poles (within 4 µm) at NEBD in our analysis for the following two reasons: First, these kinetochores often did not show clear and rapid movements towards a spindle pole, which we used to define Phase 2. Second, although we referred to kinetochore co-localization with a microtubule signal for the start of Phase 2, this was difficult for kinetochores close to spindle poles because of a high density of microtubules. As requested, we have added this comment to the Method section (page 25).

      With the above selection, all selected kinetochores without azBB treatment (control) showed the poleward motion (Phase 2) and congression (Phase 4) in this order, though their extents were varied among kinetochores. All selected kinetochores with azBB treatment also showed the poleward motion (Phase 2), and some of them showed congression (Phase 4) after Phase 2. Then, Phase 1 and Phase 3 were defined as intervals between NEBD and Phase 2 and between Phase 2 and Phase 4, respectively. If no Phase 4 was observed with azBB, we judged that Phase 3 continued till the end of tracking. We have added this comment to the Method section (page 25-26).

      (2) Would peripheral kinetochore close to poles behave differently compared to peripheral kinetochore close to the midplane (figure S4)? In figure 3D, are they separated? If not, would it look different?

      Since we did not include kinetochores close to spindle poles (at NEBD), for which it was difficult to define Phase 2 (see our response to the above major point 1), in our analysis, the suggested comparison is not feasible.

      (3) Uncongressed polar chromosomes (eg., CENPE inhibited cells) are known to promote tumbling of the spindle. In figure 5B with polar chromosomes, it will be helpful to indicate how the authors decouple spindle pole movements from individual kinetochore movements.

      In contrast to CENPE-inhibited cells, azBB-treated cells did not show much tumbling of the spindle, though both cells showed uncongressed polar chromosomes. The reason for this difference may be fewer uncongressed polar chromosomes in azBB-treated cells. There were still modest spindle motions in azBB-treated cells. However, because kinetochore motions were assessed relative to a spindle pole (and other reference points on the spindle) in our study (Figure 2A, C), the modest spindle motions were offset in our analyses of kinetochore motions. We have clarified the underlined part in the Method section (page 24).

      (4) The work has high quality manual tracking of objects in early mitosis- if this would be made available to the field, it can help build AI models for tracking. The authors could consider depositing the tracking data and increasing the impact of their work.

      As suggested, we have included kinetochore tracking data as supplemental data in the revised manuscript (Figure 3 – source data 1–4; Figure 5 – source data 1, 2).

      Minor points

      (1) It will be helpful for readers to see how many kinetochores/cell were considered in the tracking studies. Figure legends show kinetochore numbers but not cell numbers.

      As suggested, we have now mentioned the number of cells, where the kinetochore motions were analyzed, in the legends for Figures 3, 4, 5, and supplemental figures.

      (2) Discussion point: If cells had not separated their centrosomes before NEBD, would PANEM still be effective? Perhaps the cancer cell lines or examples as shown in Figure 6A have some clues here.

      Following this suggestion, we first investigated the timing of spindle elongation, relative to NEBD, in asynchronous U2OS cells (Figure 8 – figure supplement 3). We imaged cells every 5 min (it was difficult to reasonably observe enough mitotic cells using a shorter interval). Most of the cells showed no significant change in the spindle length (distance between two spindle poles) after (or around) NEBD [e.g. Cell 1 in A] or a mild reduction in it [e.g. Cell 2 in A]. Only a small number of cells (2-3 out of 26) showed a mild increase in the spindle length after (or around) NEBD [e.g. Cell 3 in A]. Because the spindle elongation after NEBD was rare and mild, it was difficult to address how the timing of spindle elongation affects the effect of PANEM on reducing chromosome scattering and on chromosome relocation from polar regions. We explained this result and discussed this topic in the Discussion section.

      (3) Figure 7 cartoon shows misalignment leading to missegregation. It may be useful to consider this in the context of the centrosome directed kinetochore movements via pivoting microtubules. Is this process blocked in azBB-treated cells?

      We understand that the Reviewer refers to the kinetochore pivoting mechanism around a spindle pole, which was recently reported by the Tolic group (Koprivec et al., 2026). Such a pivoting mechanism would work only when the spindle elongates (i.e. the distance between spindle poles is enlarged) after NEBD. Therefore, to address this Reviewer’s question, we tried to assess how PANEM contraction contributes to relocating polar chromosomes when the spindle elongates before or after NEBD in asynchronous U2OS cells (i.e. in the situation where the kinetochore pivoting mechanism is applied or not), as we noted above in response to Point 2. However, spindle elongation after NEBD was rare and mild, and we were unable to address this issue (see our response to Point 2). We discussed this matter in the Discussion section.

      (4) Are all the N-CIN- lines with PANEM highly sensitive to azBB? In other words, is PANEM essential for normal congression in some of these lines.

      Because blebbistatin could kill cells by inhibiting cytokinesis, the blebbistatin sensitivity of cell growth may not necessarily reflect how essential the PANEM contraction is for chromosome congression.

      Instead, we addressed more directly how essential the PANEM contraction is for chromosome congression. We analyzed chromosome congression in RPE1 and HCT116 cells (both are NCIN-) in the presence and absence of pnBB, the inhibitor of PANEM contraction (new Figure 8D, E). With pnBB, these cells showed congression defects, suggesting that the PANEM contraction is essential for chromosome congression in these N-CIN- cells.

      (5) Are congression times delayed in lines that naturally lack PANEM?

      For example, it takes 10-20 min for HeLa cells (lacking PANEM) to complete chromosome congression after the NEBD (Bancroft et al 2025: https://doi.org/10.1242/jcs.163659). This is not significantly different from the time (8-18 min) for chromosome congression we observed in U2OS cells (which form PANEM). We assume that cells lacking PANEM have developed a compensatory mechanism for efficient chromosome congression – we have discussed possible compensatory mechanisms in the last paragraph of the Discussion (page 17).

      (6) Page 23 "we first identified the end of congression" how does this relate to kinetochore oscillations that move kinetochores away from the metaphase plate?

      The start of kinetochore oscillation was defined as the end of Phase 4 if we could track the kinetochore until that point. In some cases where the kinetochore became close to the midplane (< 2.5 µm), it was not possible to track it further due to kinetochore crowding around the spindle mid-plane – in such cases, the end of Phase 4 was assigned as the end of tracking. These definitions were not necessarily clear in the original manuscript. Moreover, in the original manuscript, it was not clearly stated that the end of Phase 4 was defined in the same way for both non-polar and polar kinetochores. We have now clarified these points in the Method section (page 25).

      (7) Are spindle pole distances (spindle sizes) different in early and late mitotic cells (4min vs 6min after NEBD) in control vs azBB-treated cells? Please comment on Figure S2E (mean distance) in the context of when phase 4 is completed. Does spindle size return to normal after congression?

      In Figure S2E (Figure 1 – figure supplement 6 in the revised manuscript), we did not observe a significant difference in the spindle-pole distance (the spindle size) between control and azBBtreated cells at any individual time points. The smallest p-value was 0.094 at 6.0 min. As suggested, we have explained this in the legend for this supplementary figure. Completion of Phase 4 is highly variable across different kinetochores within the same cell; thus, a general comment on its completion timing in cells is not feasible.

      Significance:

      The current work builds upon their previous work, in which the authors demonstrated that an actomyosin network forms on the cytoplasmic side of the nuclear envelope during prophase. This work explains how the network facilitates chromosome capture and congression by tracking motions of individual kinetochores during early mitosis. The findings can be broadly useful for cell division and the cytoskeletal fields.

    1. eLife Assessment

      This potentially useful paper presents an intriguing hypothesis about the evolutionary origins of the SLC25 family of mitochondrial carrier proteins common to all eukaryotic life, proposing that all members originated from the ADP/ATP carrier (AAC) and that AAC itself may have emerged from bacterial homologs such as CysZ and YihY. While the phylogenetic analyses and structural searches are reasonable methodologies to explore ancient evolutionary events, the evidence provided here is deemed to provide incomplete support for the conclusion that the mitochondrial ATP transporter is related to CysZ and Yih.

    2. Reviewer #1 (Public review):

      Summary:

      This paper tries to address an important outstanding issue, which is the evolutionary origin of the SLC25 family of mitochondrial carrier proteins, which are common to all eukaryotic life, with few exceptions. The authors have carried out phylogenetic analyses and DALI searches of AlphaFold databases of bacterial and archaeal membrane proteins. They identify two bacterial proteins, CysZ and YhiY, and they propose that they are progenitors of SLC25 family members. Whilst the paper addresses an interesting topic, the conclusions are not supported by the data and are not presented in an unbiased manner, as they highlight only features that provide some tentative support for the hypothesis. They do not address the large number sequence and structural properties that refute the hypothesis, such as the asymmetric vs three-fold pseudo-symmetric features, hexamer vs monomer, and the complete lack of any conserved motifs with similar functions. Any resemblances between CysZ/YhiY and mitochondrial carriers thus seem to be superficial and could well be coincidental, as they represent generic properties of membrane proteins rather than specific ones, indicative of an evolutionary relationship.

      Strengths:

      This paper explores the evolutionary origins of the SLC25 family of mitochondrial carrier proteins, which are found across nearly all eukaryotic organisms. They were likely to be present in the last common ancestor of all eukaryotes, around two billion years ago. The question is whether they are of bacterial, archeal or eukaryotic origin. The authors propose that two bacterial proteins, CysZ and YihY, may represent ancestral forms of these carriers, based on structural comparisons of models, a sequence motif, and phylogenetic analyses. While the research addresses an important and longstanding question, the presented evidence does not convincingly support their hypothesis.

      Weaknesses:

      A central concern is the reliance on structural similarity searches using predicted protein models, since these models are often built using known protein structures as templates, and thus these searches may produce misleading matches. The reported similarities between CysZ, YihY, and mitochondrial carriers are weak and fall within ranges expected for unrelated membrane proteins, which commonly share general structural features, such as helical bundles. Quantitative measures of similarity are low and do not support a shared evolutionary origin. The case for YhiY is extremely poor as neither structure nor sequence features support the claim. Importantly, the opening of the YihY is towards the membrane rather than the water phase, as is the case for carriers, indicating that it has a very different structure and function. The case for CysZ is somewhat better, as it is a helical bundle with two short helices somewhat resembling the matrix helices of mitochondrial carriers, and a short sequence PXDXXK that is part of one of the known sequence motifs of mitochondrial carriers, but this is where the similarities end.

      Mitochondrial carriers have a distinctive threefold pseudo-symmetrical structure and a highly complex transport mechanism involving six structural elements. This paper's hypothesis does not explain how such a high level of threefold pseudo-symmetry could have evolved from entirely asymmetric proteins. To complicate matters further, CysZ is not functional as a monomer but forms a functional hexamer, which also explains why it has two half helices rather than two transmembrane helices. Thus, the hypothesis is that CysZ, which is an asymmetric protomer of a functional hexamer, has evolved into a three-fold pseudo-symmetric protein, which is functional as a monomer. A more convincing explanation is that the threefold pseudo-symmetrical structure arose from gene triplication and fusions, with later mutations introducing asymmetry to support diverse substrate binding. In support of this notion, mitochondrial carriers transporting large molecules, such as ATP, show more asymmetry, whereas those for small molecules remain nearly symmetrical. In general, the vast majority of transport proteins arose from gene duplications and fusions of the domains.

      Although mitochondrial carriers have a similar sequence motif as found in CysZ (PXDXXK), their roles are very different. In mitochondrial carriers, this motif is located roughly in the middle of transmembrane helices H1, H3, and H5, where proline creates a pronounced kink, bringing the charged residues inward to form a salt-bridge network in the central water-filled cavity. The formation and disruption of this network is essential for the transport mechanism when switching between inward- and outward-open states. In CysZ, the motif is found at the end of a helix and in the following loop at the end of the transporter, with residues pointing outward toward the water phase. These residues are typical of membrane-water interface regions, where proline acts as a helix breaker and charged residues interact with the water phase. Thus, this motif in CysZ does not match the position or function seen in mitochondrial carriers, and its presence is likely to be coincidental, because these residues often occur in the water-membrane region. Importantly, none of the other important conserved three-fold symmetrical motifs of mitochondrial carriers is found in these bacterial proteins, such as the cytoplasmic network [YF][DE]xx[RK], cardiolipin binding sites, ER-links, and sequences of small amino acids, which are critical for its dynamic mechanism.

      The phylogenetic relationship is also overstated, as there is no sequence similarity between these proteins other than that occurring because of similar biophysical properties, such as transmembrane helices. The authors suggest that a specific mitochondrial carrier represents the ancestral member of the family, but this conclusion appears to be inferred rather than rigorously demonstrated. Key aspects, such as tree rooting and taxon sampling, are not sufficiently addressed, weakening confidence in the evolutionary claims. Further, the selection of only a few bacterial and archaeal proteomes for analysis limits the study's scope. Broader searches would be necessary to support claims about conservation and ancestry. Independent sequence searches indicate that CysZ and YihY are not widely conserved in the bacterial groups most closely related to mitochondria, undermining the argument that they are plausible ancestors.

      Overall, the presented similarities are superficial and can be explained by general features of membrane proteins rather than by specific adaptations to function. The hypothesis that CysZ and YihY are evolutionary precursors of mitochondrial carriers is not supported by the presented data.

    3. Reviewer #2 (Public review):

      Summary:

      Here, the authors performed a phylogenetic analysis of mitochondrial ATP/ADP carrier (AAC) proteins. They also performed a structure-based screen for remote homologs, seeking to reveal their evolutionary origins. The authors claim that AACs are found at the root of their family tree, and through a structure-based homolog search protocol, identify putative prokaryotic homologs.

      The proposed evolutionary history of AACs is bold and complicated, but the phylogenetic methodology and the way in which the tree is interpreted are incomplete and unconvincing. Further, the structure-based search strategy uses very relaxed cutoffs for fold similarity, which may be fine, but it does not clearly justify this decision. This is potentially very problematic, as I did not find the quantitative or qualitative assessments of fold similarity particularly compelling.

      In summary, the authors have presented a bold and extremely interesting hypothesis for the evolution of these proteins, but there is insufficient support for their claims.

      Strengths:

      (1) The authors are presenting a very interesting hypothesis about the birth of these proteins, including that they may have undergone a radical rearrangement in their sequence at some point in evolution.

      (2) The paper makes use of appropriate tools for structure-based homolog identification.

      (3) Identification of a conserved sequence motif in these twilight zone proteins would be a rare and interesting occurrence, and could be consistent with their proposed homology.

      Weaknesses:

      (1) The phylogenetic analysis and its interpretations are incomplete. The authors regularly refer to the root of the tree, and its placement is given central importance. However, the methodology by which they selected the root is unexplained. This is notable, as the proposed root is curious and quite confusing. It implies that (at least) yeast and Paramecium AACs are independently paraphyletic. While certainly not impossible, this evokes quite a complicated evolutionary history. The taxonomy of this gene family, when rooted this way, does not seem to echo the phylogeny of species, suggesting an extremely complex history of duplication/loss and horizontal gene transfer, none of which the authors discuss in detail. Perhaps more clearly and specifically: I'm very surprised by the branching order at the root, where there are three independent branches of fungal proteins, followed by the excavate proteins in a monophyletic clade, followed by several independent branches of the Paramecium proteins. I very much expect incomplete lineage sorting at this evolutionary depth, but this seems extreme to the point that I question if it is accurately placed. More directly: this very much looks like an unrooted tree, presented radially.

      (2) The Bayesian and ML trees seem quite incongruent, but this is not discussed. In fact, the text states that they "exhibit a similar tree topology." This is admittedly very difficult to assess without very carefully going over the tree, branch by branch, but there are nevertheless differences, the most obvious being paraphyly vs monophyly of taxon-specific AAC clades. Do the authors have any comments on this, and can they show some sort of consensus tree? How does this affect their interpretation?

      (3) Presenting branch support as similarly-sized points makes it nearly impossible to actually judge the strength of support.

      (4) The use of structure for remote homology detection is becoming increasingly popular, and in my opinion, is very powerful. But it is still much too early to be taken for granted. The methodology must be justified. Most importantly, the authors have not clearly described why they chose these quantitative cutoffs (I'm mostly thinking of the Dali Z-score cutoff, which here seems very low for a transmembrane protein of this size, as the Z-score is very dependent on alignment length). The authors reference categories defined by tool authors, but why a Z-score of 3, specifically? The same goes for TM scores. There are not yet any defined best practices, to my knowledge, so the authors should independently validate/justify their approach in some way and/or cite and discuss relevant literature (there have been a growing number of these screens using similar approaches in recent years).

      (5) The proposed homologs have very little quantitative structural similarity to the query structure, or to each other, as shown in Figure 3 (and hence my concerns about the methodology). Also, I did not find the structural alignments in the supplement or Figure 4 to be qualitatively compelling. They simply appear too different, and I cannot discard this qualitative assessment because the quantitative similarities are likewise very weak. It's not clear to me if this is because the folds are in fact different, or if my view of them is a presentation issue (perhaps it could be improved by visualizing more angles, or more carefully cartooning the similarities and differences).

      (6) The authors point out that the alpha-helices are ordered differently in YihY and CysZ, and that their membrane orientation is flipped. Taken at face value, I would view this as evidence against homology. This could perhaps be more reasonably explained as convergent global fold similarity resulting from different underlying structures. However, the authors imply that this may be the result of the transposition of the sequences encoding these alpha helices, yet there is no convincing description or argument concerning when and how this could have occurred. I think this would be a deeply interesting phenomenon, but there is insufficient evidence and discussion to seriously consider whether or not it is homology or convergence.

      (7) Following up on comment #5, the authors did perform a very interesting in silico experiment by transposing sequences to reorder the helices. They then note that structural similarity improved. This is very, very interesting, but without other evidence of homology between the transposed alpha helices, I do not think this disproves alternative hypotheses. Does any such evidence exist?

      (8) The authors show in Figure 5E-F that sequence transposition flips the membrane orientation, such that YihY and CysZ have extracellular termini (which you would expect from homologs, I suppose). But it is just cartooned and not discussed. Is this computationally or experimentally supported?

      (9) The putative presence of a conserved motif would be a very compelling piece of evidence consistent with homology. However, it is not clear to me in the text which proteins actually have the repeats - is it truly just CysZ? What does this mean for YihY? Further, what specifically is being proposed to be homologous? Is SLC25 repeat 2 proposed to be homologous to CysZ repeat 2 (and the same for 3 to 3)? If so, this would seem to have implications for the transposition hypothesis. The helix nomenclature (e.g., H1-6) suggests homology across the proteins (i.e, H1 is homologous to H1); however, wouldn't the presence of these conserved domains instead, for example, suggest homology between SLC H3 and CysZ H2? The authors' conclusions are not clear, and it is difficult to interpret what the implications are for assessing homology.

      (10) The sequence retrieval methods are incomplete, so it is impossible to reproduce the searches or to judge their accuracy and scope. What were the E-value cutoffs and other settings used in the searches?

      (11) The phylogenetic methods are incomplete. What substitution models were used, and how were they chosen? What branch support method was used? What were the stop conditions of the Bayesian analysis (e.g. did the authors monitor for convergence, and how)? How much of the Bayesian analysis was considered burn-in, if any? And echoing points 1 & 2 above, how were these phylogenies rooted?

      (12) Throughout, there is a distinct lack of careful, evolutionarily informative language.

      (i) In reference to the phylogeny, the authors frequently refer to "grouping," but it's not entirely clear what this means. Referring to clades and their branching order would be more informative.

      (ii) The authors refer to the excavate branch as the "most ancient." Whether or not excavates most closely resemble LECA is somewhat irrelevant, because the branch itself is not the most ancient - it is equally as ancient as its sister branch, which may be all other eukaryotes.

      (iii) Likewise, the authors refer to bacterial proteins as "the evolutionary ancestor of mitochondrial AACs," and state that "AAC emerged from the conserved sulfat transporter CysZ." But extant bacteria are not the ancestors of mitochondria - nor are extant proteins descended from other extant proteins. They are, perhaps more accurately, cousins.

      (iv) The authors refer to AACs as "evolutionarily founder member of the SLC25 carrier family," but I'm not sure that has a clear evolutionary meaning, unless the authors mean to say that the common ancestor was more AAC-like than anything-else-like. Even if the rooting is accurate, a basal branch does not necessarily reflect the ancestral state.

    4. Reviewer #3 (Public review):

      Summary:

      The most important weakness is that the authors have avoided the direct structural comparison of experimentally determined x-ray structures of AAC and CysZ. Instead, the comparisons are made through predicted membrane topologies and predicted structural models of protein homologs, which give rise to misleading results. Direct comparison of the X-ray structures of the ADP/ATP carrier and CysZ clearly shows that these proteins have very different folds. Therefore, flaws in the methods produce results that lead to the wrong conclusions, and the authors have not achieved their aims.

      Weaknesses:

      (1) Figure 2. There is something very strange about how the tree is drawn, given that S. cerevisiae AAC1, AAC2 and AAC3 share about 76-83% sequence identity but appear to be very diversified in the tree. The phylogenetic trees are only based on the sequences of three species. The authors should explain in much more detail how they made the phylogenetic trees to support their statement that all mitochondrial carriers have come from an ancient AAC.

      (2) There are at least three and seven X-ray structures of CysZ (with about 43% sequence identity to the E. coli homolog) and AAC, respectively, deposited in the Protein Data Bank. Therefore, there is no need for the approach using predicted structures as described in the manuscript. It is clear from direct comparison of the CysZ and AAC structures that they have very different folds, i.e. lengths of the transmembrane helices, their orientation and packing. CysZ has been suggested to form dimers or trimers of dimers (eLife 2018;7:e27829), with each protomer formed by two long transmembrane helices and four short helices that do not cross the membrane totally. Thus, CysZ has a different membrane topology and oligomeric state than AAC (monomer with six transmembrane helices). CysZ is therefore rightfully classified in a separate 3D domain fold from mitochondrial carriers in various protein family and domain databases.

      (3) In the 3D structures of CysZ, the conserved QYXDYPXDNHK motif is involved in a network of hydrogen bonds and salt bridges thought to hold the helical bundle together (eLife 2018;7:e27829). This motif is similar to PX[DE]XX[KR], a part of the signature motif, typical of mitochondrial carriers, which is repeated three times in the sequences and forms a three-fold pseudo-symmetrical salt bridge network of the so-called matrix gate that opens and closes during the transport cycle. Therefore, although this single motif in CysZ is similar to those of mitochondrial carriers, it is not found in a similar structural context to those in AAC structures.

      (4) It appears odd that the sulfate transporter CysZ should be more similar to nucleotide-transporting AAC than any of the other mitochondrial carriers, of which some transport sulfate.

      (5) The alphafold model of YihY is not very similar to either the crystal structures of CysZ or AAC.

      (6) The authors are relying too much on the TM-score results. The values of 0.5-0.6 between AAC and CysZ or YihY probably reflect that they contain six main helices. However, as noted in point 2, they have very different folds.

    5. Author response:

      Thank you for your decision letter with the public review and the recommendations. While we are delighted that the referees feel the work is addressing an outstanding and important issue, they have raised concerns regarding the strength of the support. We will address all the concerns in full in a revised manuscript in the due course. Please find below a couple of general points regarding the referees’ concerns and a proposal as to how we plan to address them.

      (1) The idea of the manuscript is to present a plausible solution for a long-standing question in the field of mitochondrial biology and evolution. The fact that the identified solution to the origin of AAC transporters is a remote structural homolog (as you will see in our later detailed response that it is better than any other sequence/structure available till date) is to be expected. If the actual similarities were any better than what we have identified (with a special case of circular permutation), they could have been identified by other simpler structural homology search methodologies.

      (2) A recurrent and strong disagreement of the reviewers on the findings presented in this manuscript is rooted on the fact that the structural and sequence relatedness between AAC and CysZ detected in this work are so weak that they can be co-incidental and not an actual evolutionary link. Based on the above, we now searched carefully in all available structural databases such as SCOP, CATH, ECOD etc. whether the above fold link has been noted by others independently. We notice that in the ECOD (Evolutionary Classification of Protein Domains) database only AAC and CysZ are grouped together under a single Possible homology group (X) called ‘Mitochondrial ADP/ATP carrier-like’. The ECOD database contains hierarchical classification of protein domains organized according to their evolutionary relationships and the server is maintained by Prof. Nick Grishin at The University of Texas Southwestern Medical Center.

      Link to ECOD database: http://prodata.swmed.edu/ecod/index_af2_pdb.php

      Reference: Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, et al. (2014) ECOD: An Evolutionary Classification of Protein Domains. PLOS Computational Biology 10(12): e1003926. https://doi.org/10.1371/journal.pcbi.1003926

      Therefore, our study and the independent findings of the ECOD database team together offers greater confidence on the proposed remote evolutionary relationship between AAC and CysZ, and that the structural and sequence similarity we report in the manuscript are not a mere co-incidence. We will also incorporate the details of possible evolutionary relationship between AAC and CysZ identified in the ECOD database in the revised version of manuscript.

      (3) One point we would like to stress is that considering all the similarities identified, it very unlikely falls into the class of ‘convergent evolution’. We will make this point explicit in the revised version.

      (4) Lastly, while we totally agree that the similarities are in the twilight zone, considering the importance of the problem, we feel that our work would induce researchers from the field of protein design to attempt possible interconversion of the two distantly related transporters thus providing an experimental rationale for the evolution of these transporters.

    1. eLife Assessment

      This valuable study uses a computer vision pipeline to infer the motor control of cephalopod skin, revealing that individual chromatophores exhibit anisotropic deformations and can be associated with multiple putative motor units. The evidence supporting these claims is convincing, and the authors present some limited electrophysiological validation of the findings from their computational analysis. This work will be of significant interest to biologists studying cephalopod behavior and motor control.

    2. Reviewer #1 (Public review):

      Renard, Ukrow et al. applied their recently published computational pipeline (CHROMAS) to the skin of Euprymna berryi and Sepia officinalis to track the dynamics of cephalopod chromatophore expansion. By segmenting each chromatophore into radial slices, and analyzing the co-expansion of slices across regions of the skin, they inferred the motor control underlying chromatophore groups.

      Strengths:

      - The authors demonstrate that most motor units of cephalopod skin include a subregion of multiple chromatophores, creating "virtual chromatophores" between fixed chromatophores. This is an interesting concept that challenges prevailing models of chromatophore organization, and raises interesting possibilities for how chromatophore arrays may be patterned during development.

      - This study introduces new analytical approaches of cephalopod skin that will be valuable for the quantitative study of cephalopod behavior.

      Weaknesses:

      - The authors use patch-clamp experiments in E. berryi to test their approach for inferring motor units. The stimulations indeed evoke expansions of sub-regions of each chromatophore, creating "virtual chromatophores". However, they were not able to predict these motor units from behavioral analysis before confirming them with patch-clamp, limiting the strength of this validation.

      - In S. officinalis, chromatophores are far more numerous than in E. berryi and exhibit frequent spontaneous activity, making it more challenging to distinguish shared motor drive. Patch-clamp experiments in this species would provide important validation and strengthen confidence in the method for inferring motor units.

      - Although multiple experimental conditions were tested (e.g., age, size, behavioral context, sedation, head-fixation, lighting), data is only shown from a small subset of experiments. Analyzing pooled data across conditions would allow for more generalizable conclusions.

      - Different clustering algorithms were used for the two species (HDBSCAN for E. berryi and Affinity Propagation for S. officinalis). Since Affinity Propagation appeared to better capture correlation structure in S. officinalis, it would be informative to reanalyze the E. berryi data using the same method to assess potential algorithm-dependent biases.

      Conclusion:

      The CHROMAS tool is likely to be valuable to the field, given the need for quantitative frameworks in cephalopod biology. The predictions outlined here provide a useful foundation for future experimental investigation.

    3. Reviewer #2 (Public review):

      Summary:

      Overall, this is an excellent paper, making use of a newly developed system for monitoring the behaviour of chromatophores in the skin of (mostly) free swimming bobtail squid and European cuttlefish. The manuscript is very well written, clearly presented and very well structured. The central finding, that individual chromatophores are connected to multiple motor neurones, is not new. Novelty instead comes from the ability to measure the actuation of chromatophore sections across wide areas of skin in free-swimming animals, showing the diversity of local motor units and reinforcing the notion that individual chromatophores are not necessarily the individual units of colour change, but rather local motor units that cover multiple neighbour and near neighbour chromatophore muscles. This is an excellent finding and one that will shape our understanding of the neural control of cephalopod skin colour. I have a number of minor points below that the authors will need to address before acceptance.

      Strengths:

      The methodological approach to collecting large amounts of data about local variations in the expansion of sections of chromatophores is exciting, and the analysis pipeline for clustering sections of chromatophores whose spontaneous activity correlated over time is powerful and exciting.

      Comments on revisions:

      All concerns have been addressed in the revised version of the manuscript.

    4. Reviewer #3 (Public review):

      Summary:

      This study uses high-resolution videography and a custom computer-vision pipeline to dissect the motor control of cephalopod chromatophores in Euprymna berryi and Sepia officinalis. By quantifying anisotropic chromatophore deformations and applying dimensionality reduction methods, the authors infer that individual chromatophores can be a part of multiple motor units. Clustering analyses reveal putative motor units that often span multiple chromatophores, with diverse and overlapping geometries. Chromatophore expansion dynamics are faster and more stereotyped than relaxation, consistent with active neural contraction followed by passive recoil. Together, the results show that chromatophores function not as uniform pixels but as fractionated, coordinately controlled elements that enable flexible pattern generation

      Strengths:

      The authors present compelling, direct evidence that a). chromatophore deformations are anisotropic, and indirect evidence that b). individual chromatophores can be split across multiple putative motor units. This evidence is provided through data collected over large spatial scales, but also at a sub-chromatophore resolution. This combination of scale and resolution is not possible using traditional neuroanatomical and physiological approaches alone.

      The authors also develop a new non-invasive, image analysis approach to extract information about chromatophore deformation across large spatial scales on the organism's body. In principle this approach is applicable across species and may allow for further comparative characterization of chromatophore motor control. It is therefore a promising new tool and useful resource for the community.

      Weaknesses:

      An important weakness of the work is that the methods the authors develop can only be applied during resting, spontaneous 'flickering' activity of chromatophores to yield interpretable results at the motor unit level. This is because common presynaptic input would confound the identification of individual motor units. Thus, there remains a large difficulty in linking insights about single motor unit organization to the circuit and behavioral levels.

      Another weakness of this paper is the rather limited electrophysiological validation of the computational findings. The authors present only one electrophysiology experiment in E. berryi, the species that they used only for 'methodological development' and not for detailed characterization. A complementary electrophysiological experiment in S. officinalis, or some visualization of neuron morphology confirming that motor neurons do indeed project to multiple chromatophores would strengthen the generalizability of their computational analysis. This would be particularly pertinent to validate the author's claim that some motor units contain chromatophores that are quite distant from one another on the animal.

      Overall, the authors' technical contributions and method development are an important advance. This work serves as an excellent proof of concept that their method can extract useful information about chromatophore motor control. Further validation of their method is needed to fully trust the fine-scale conclusions drawn about the distribution and composition of multi-innervated chromatophores. Furthermore, the authors raise many interesting ideas about developmental constraints on circuit wiring and potential adaptive significance of multi-innervated chromatophores for certain features of camouflage patterning. Their method may be able to help resolve some of these questions in the future if it is refined and applied across developmental stages, regions on the animal, and across species

      Comments on revisions:

      Thank you for clarifying my major point of confusion regarding how one might connect these results to behaviorally relevant camouflage. I now have a better understanding of the author's rationale in studying resting activity of motor units and believe that the clarifications added to the manuscript will help other readers who encounter similar confusion.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Renard, Ukrow et al. applied their recently published computational pipeline (CHROMAS) to the skin of Euprymna berryi and Sepia officinalis to track the dynamics of cephalopod chromatophore expansion. By segmenting each chromatophore into radial slices and analyzing the co-expansion of slices across regions of the skin, they inferred the motor control underlying chromatophore groups.

      Strengths:

      The authors demonstrate that most motor units of cephalopod skin include a subregion of multiple chromatophores, creating "virtual chromatophores" in between the fixed chromatophores. This is an interesting concept that challenges prevailing models of chromatophore organization, and raises interesting possibilities for how chromatophore arrays may be patterned during development.

      This study introduces new analyses of cephalopod skin that will be valuable for the quantitative study of cephalopod behavior.

      Weaknesses:

      The authors chose to image spontaneous skin changes in sedated animals, rather than visually-evoked skin changes in awake, freely-moving animals. Spontaneous chromatophore changes tend to be small shimmers of expansion and contraction, rather than obvious, sizable expansions. This may make it more challenging to distinguish truly co-occurring expansions from background activity. The authors don't provide any raw data (videos) of the skin, so it is difficult to independently assess the robustness of the inferred chromatophore groupings.

      The patch-clamp experiments in E. berryi are used to test the validity of their approach for inferring motor units. The stimulations evoke expansions of sub-regions of each chromatophore, creating "virtual chromatophores" as predicted from the behavioral analysis. However, the authors were not able to predict these specific motor units from behavioral analysis before confirming them with patch-clamp, limiting the strength of the validation. It would be informative to quantify the results of the patch-clamp experiments - are the inferred motor units of similar sizes to those predicted from behavior?

      The authors report testing multiple experimental conditions (e.g., age, size, behavioral stimuli, sedation, head-fixation, and lighting), but only a small subset of these data are presented. It is difficult to determine which conditions were used for which experiments, and the manuscript would benefit from pooling data from multiple experiments to draw general conclusions about the motor control of cephalopod skin.

      The authors use a different clustering algorithm for E. berryi and S. officinalis, but do not discuss why different clustering approaches were required for the two species.

      Impact:

      The authors use their computational pipeline to generate a number of interesting predictions about chromatophore control, including motor unit size, their spatial distribution within the skin, and the independent control of subregions within individual chromatophores by putatively distinct motor neurons. While these observations are interesting, the current data do not yet fully support them.

      The CHROMAS tool is likely to be valuable to the field, given the need for quantitative frameworks in cephalopod biology. The predictions outlined here provide a useful foundation for future experimental investigation.

      We thank the reviewer for the thoughtful and detailed evaluation of our work and for recognizing the potential of the CHROMAS pipeline for studying chromatophore control.

      We agree that some aspects of the manuscript required clarification and additional explanation, and we have revised the text accordingly. We also now provide access to representative raw video recordings in the Data Availability section. In the E. berryi patch-clamp experiments, single motor neurons evoked expansions of sub-regions of chromatophores, consistent with the “virtual chromatophore” concept. We have now quantified the size of motor units across patch-clamp sessions, and the results show that the inferred motor-unit sizes broadly match those predicted from behavioral recordings, supporting the validity of our approach.

      We agree that pooling data across individuals would provide valuable insight into variability across animals. In practice, we recorded chromatophore activity from several animals (14 Euprymna berryi and 12 Sepia officinalis) under different experimental conditions during development of the experimental pipeline. However, acquiring long, stable, artifact-free recordings suitable for motor unit analysis is technically challenging. We now clarify this point in the manuscript. Specifically, we explain that multiple animals were recorded during pipeline development, while the analyses presented focus on recordings with the highest signal quality. We anticipate that the framework introduced here will enable future studies to collect larger datasets and compare motor unit organization across individuals, developmental stages, and species.

      HDBSCAN was used for E. berryi during initial exploratory analyses, and Affinity Propagation was adopted for S. officinalis because it better captured the correlation structure of those recordings. We did not re-analyze the E. berryi data with Affinity Propagation, and the implications of algorithm choice are now discussed in the Discussion.

      Reviewer #2 (Public review):

      Summary:

      Overall, this is an excellent paper, making use of a newly developed system for monitoring the behaviour of chromatophores in the skin of (mostly) free-swimming bobtail squid and European cuttlefish. The manuscript is very well-written, clearly presented and very well-structured. The central finding, that individual chromatophores are connected to multiple motor neurones, is not new. Novelty instead comes from the ability to measure the actuation of chromatophore sections across wide areas of skin in free-swimming animals, showing the diversity of local motor units and reinforcing the notion that individual chromatophores are not necessarily the individual units of colour change, but rather local motor units that cover multiple neighbour and near-neighbour chromatophore muscles. This is an excellent finding and one that will shape our understanding of the neural control of cephalopod skin colour.

      Strengths:

      The methodological approach to collecting large amounts of data about local variations in the expansion of sections of chromatophores is exciting, and the analysis pipeline for clustering sections of chromatophores whose spontaneous activity correlated over time is powerful and exciting.

      Weaknesses:

      Some minor edits and typographical errors need correcting. I also had some concerns that the preparation for the electrophysiological section of the manuscript complies with the journal's ethical requirements, so I would urge that this be carefully checked.

      We thank the reviewer for the positive evaluation of our work and for recognizing the value of the methodological approach and the clarity of the manuscript.

      We have carefully reviewed the manuscript and corrected minor typographical errors.

      Regarding the ethical considerations raised for the electrophysiological experiments, we have carefully verified that the experimental procedures comply with the journal's ethical requirements and relevant institutional guidelines.

      Reviewer #3 (Public review):

      Summary:

      This study uses high-resolution videography and a custom computer-vision pipeline to dissect the motor control of cephalopod chromatophores in Euprymna berryi and Sepia officinalis. By quantifying anisotropic chromatophore deformations and applying dimensionality reduction methods, the authors infer that individual chromatophores can be a part of multiple motor units. Clustering analyses reveal putative motor units that often span multiple chromatophores, with diverse and overlapping geometries. Chromatophore expansion dynamics are faster and more stereotyped than relaxation, consistent with active neural contraction followed by passive recoil. Together, the results show that chromatophores function not as uniform pixels but as fractionated, coordinately controlled elements that enable flexible pattern generation

      Strengths:

      The authors present compelling, direct evidence that a). chromatophore deformations are anisotropic, and indirect evidence that b) individual chromatophores can be split across multiple putative motor units. This evidence is provided through data collected over large spatial scales, but also at a sub-chromatophore resolution. This combination of scale and resolution is not possible using traditional neuroanatomical and physiological approaches alone.

      The authors also develop a new non-invasive, image analysis approach to extract information about chromatophore deformation across large spatial scales on the organism's body. In principle, this approach is applicable across species and may allow for further comparative characterization of chromatophore motor control. It is therefore a promising new tool and useful resource for the community.

      Weaknesses:

      An important weakness of the work is that the methods the authors develop can only be applied during resting, spontaneous 'flickering' activity of chromatophores. The inability to reliably apply their technique during any kind of realistic camouflage is a large limitation, as it means this method cannot be used to study the dynamics of motor control during realistic camouflage behaviors.

      Another weakness of this paper is the rather limited electrophysiological validation of the computational findings. The authors present only one electrophysiology experiment in E. berryi, the species that they used only for 'methodological development' and not for detailed characterization. A complementary electrophysiological experiment in S. officinalis, or some visualization of neuron morphology confirming that motor neurons do indeed project to multiple chromatophores, would strengthen the generalizability of their computational analysis. This would be particularly pertinent to validate the author's claim that some motor units contain chromatophores that are quite distant from one another on the animal.

      Overall, the authors' technical contributions and method development are an important advance. This work serves as an excellent proof of concept that their method can extract useful information about chromatophore motor control. Further validation of their method is needed to fully trust the fine-scale conclusions drawn about the distribution and composition of multi-innervated chromatophores. Furthermore, the authors raise many interesting ideas about developmental constraints on circuit wiring and potential adaptive significance of multi-innervated chromatophores for certain features of camouflage patterning. Their method may be able to help resolve some of these questions in the future if it is refined and applied across developmental stages, regions of the animal, and across species

      We thank the reviewer for their thoughtful evaluation and for recognizing the potential of the computational approach introduced in this study.

      Regarding the focus on spontaneous chromatophore activity, we have clarified earlier in the Results section why these events are necessary to isolate individual muscle activations. While large camouflage patterns are visually striking, they involve the coordinated activation of many groups of chromatophores by premotor circuits simultaneously, making the identification of individual motor units, our goal here, impossible. Our approach can, however, also be applied during active behavior, including camouflage; the questions addressed there would be different, focusing on how multiple motor units are coordinated to generate the resulting skin patterns, rather than resolving the structure of single motor units. This could be challenging if the patterns of premotor control are highly variable, thus making the detection of meaningful or interpretable motion correlations difficult. This remains to be tested.

      We also acknowledge that electrophysiological validation remains limited. Patch-clamp experiments were performed in Euprymna berryi to test predictions generated by the computational analysis, and these experiments confirmed that activation of single motor neurons can produce anisotropic expansion of chromatophore subregions. We now provide the associated datasets in the Data Availability section. We agree that complementary electrophysiological or anatomical experiments in Sepia officinalis would further strengthen the conclusions. Such experiments represent an important direction for future work.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      General points:

      (1) Given all the experimental conditions and animals tested, the manuscript would be much stronger if the figures represented pooled data from many animals and experiments (e.g. Figure 1C).

      We agree that pooling data from multiple animals would strengthen the manuscript. In practice, we tested these experimental conditions across several animals (14 Euprymna berryi and 12 Sepia officinalis), but we selected the segments shown in the figures for their minimal artifacts and errors. Acquiring high-quality, stable recordings of this type is extremely challenging, and the presented data represents the clearest examples suitable for analysis and visualization. We hope that in the future these methods will enable not only the collection of a larger, high-quality dataset, but also comparisons across individuals, ages, species, and different regions of the mantle.

      (2) It's very unclear what animals were used for each experiment:

      (a) E. berryi: L677 states that 14 animals were filmed, and L684 implies that non-sedated individuals were used in addition to sedated animals, but it appears all the data is from a single E. berryi with sedation?

      The original wording was unclear, so we modified the sentence for clarity. The Methods now specify that 14 animals were filmed to refine the experimental pipeline and explore different conditions, while the data presented in the Results are from a single lightly sedated individual chosen for quality and stability of chromatophore activity.

      (b) S. officinalis: L692 onwards states that lots of different conditions and animals were explored, but only minimal data from a couple of animals is described in the figures. L156 states that all (?) the data comes from one head-fixed animal and one sedated and head-fixed animal. L549: The conclusion states that the pipeline was used in freely moving animals, but it appears that all of the S. officinalis were head-fixed? This is very confusing. Rather than describing the conditions of every experiment ever performed, the manuscript would benefit from explicitly stating the experimental conditions used for each figure.

      The original text was unclear. We have clarified in the manuscript which animals and experimental conditions were used for the analyses in each figure. To clarify, E. berryi was recorded without head fixation, whereas S. officinalis data were obtained under head-fixed conditions. We did film 11 S. officinalis without head fixation, and data can in principle be extracted from these recordings. Head fixation was used both to minimize visual artifacts and to enable longer, stable recordings, which was important for capturing the highest level of apparent noise in motor unit activation—information that is critical for our analyses of motor-unit organization, though not necessary for studies of broader camouflage patterns. Our computational pipeline enables large-scale analyses that would be very difficult or impossible with traditional electrophysiology, not that all data were acquired from freely behaving animals. While fully unconstrained recordings remain technically challenging due to optical and logistical constraints, we maintain that our approach provides a valid framework for analyzing freely behaving animals.

      (c) Additionally, there is a claim that the sedated condition represents the unsedated one (e.g. L151 and L643), but no data is shown to support this. L173 references Figure 6d as evidence, but 6d doesn't exist. Only L210 provides sedation/no sedation statistics for the number of components per motor unit. However, in L643 it says "and motor unit organization remained unchanged". This data needs to be shown to include that statement.

      Reference to the inexistant 6d figure was removed. L170 provides statistics for the number of principal components per chromatophore, and L210 provides statistics for the number of components per MU. We do not think a sub-figure is necessary. We, however, agree that L643 “motor unit organisation” is potentially misleading as we only compared the number of chromatophores belonging to a single MU and not the MU shape or distribution. Changed “organization” to “size (in chromatophores)”.

      (3) The text needs considerable revision. There are many typos (including multiple instances of "refs" instead of the actual references being inserted). These issues make the manuscript much more difficult to evaluate.

      Our apologies. We have now added the missing refs.

      (4) It is not clear how convincing the chromatophore groups are. For instance, Figure 4h could alternatively be interpreted as a group of 5 chromatophores in a motor group that happen to co-vary with a sixth one at a great distance. Without seeing some of the raw data (videos), it's difficult to assess how convincing it is that these chromatophores belong to the same group. I recommend analyzing: when multiple chromatophores expand together, what is the likelihood that other chromatophores also happen to expand at the same time (given the frequency that they're all changing shape spontaneously)?

      We appreciate the reviewer’s concern. Chromatophores are assigned to the same cluster because their activity, or that of their slices, covaries consistently over time. It is, of course, possible that what appears as a single motor unit may reflect two or more motor neurons acting simultaneously during the recording. Longer video segments increase confidence in the integrity of inferred motor units, but in the absence of a ground truth for motor unit spatial organization in this species at this age, it is difficult to quantify the likelihood that two motor units are being conflated. Raw video data is provided in the Data Availability section. We note, however, that most of the time motor units cannot be readily discerned by eye, because individual chromatophores and their constituent slices fluctuate continuously, and motor-unit correlations are subtle and distributed across multiple chromatophores.

      (5) The rationale for focusing on spontaneous activity is introduced relatively late in the manuscript and would benefit from being stated earlier. Examples should be provided of what this looks like (as opposed to regular chromatophore expansion). It would be valuable to see measurements across many experiments of how expanded the chromatophores are - what is the change in surface area? And what is the frequency of expansion for each chromatophore?

      Thank you for the remark. This is true. We have added a paragraph at the beginning of the Results section to clarify the rationale for focusing on spontaneous activity.

      This section now reads:

      “Because our primary aim was to describe the composition and coordination of chromatophore motor units, it was important to examine animals in the absence of the descending commands that occur during active behavior. Spontaneous activity, typically mild and “noisy” was thus ideal to enable measurements of the motion correlations between chromatophores that reflected shared motor neuron drive, rather than shared correlations due to upstream motor neuron groupings by premotor circuits.”

      We added an example of video recording of spontaneous activity in our Data Availability section.

      While quantifying expansion magnitude and frequency across experiments would indeed be valuable, these questions fall outside the primary focus of the present study, which centers on resolving motor unit organization. In the section “Dynamics of chromatophore expansion and contraction,” we analyze the speed of expansion and contraction to demonstrate that such kinetic features can be reliably detected with the temporal resolution of our video imaging approach. By isolating single muscle activations, we establish a methodological framework that can be used in future work to quantify expansion amplitude, rate of change and frequency across preparations.

      (6) Chromatophore expansion was only measured in anesthetized E. berryi, and L679 states that chromatophore expansion was triggered by shining light on the skin. However, light-mediated chromatophore expansion may be mediated by a different mechanism, so chromatophore correlations do not necessarily reflect the underlying motor control.

      We agree that there is, in principle, a theoretical risk of direct light-mediated activation of chromatophores. Yet, the kinetics of this light mediated activation are very different, and are the object of a separate, on-going investigation by our groups. In our experiments, the illumination was applied to the whole animal rather than locally to the skin, ensuring that all chromatophores and the eyes were exposed to the same light source. By transitioning from darkness to light, we created a window in which chromatophores were partially expanded—both fully contracted and fully expanded states would show little to no decorrelation. Within this window, we observed spontaneous fluctuations in chromatophore activity, which formed the basis for our correlation analyses. To our knowledge, direct light-mediated expansion of chromatophores has not been reported in E. berryi although it may exist there. Finally, the size, shape, and orientation of the inferred motor units align with electrophysiological evidence, supporting the validity of our motor unit inferences.

      (7) Some figures might be better suited for the supplement. For instance, it's not clear what the significance of Figure 5 is (it's not currently sufficiently justified in the text).

      We have clarified the purpose of Fig. 5 in both the Results and Discussion sections. In the Results, we now explain that events are separated by amplitude to show that expansion–contraction kinetics can be reliably measured across a full range of chromatophore events, validating the precision of our videographic approach. In the Discussion, we highlight that this precision allows measurement of radial muscle speeds and opens avenues to study chromatophore biomechanics, including the contributions of intertwined forces such as radial muscles, elastic pigment sacs, and intercellular coupling.

      (8) Multiple chromatophores can belong to multiple clusters - this study reveals that this is because subsections of a chromatophore are controlled separately. But do the same sections (slices) of chromatophores ever belong to multiple clusters?

      Yes, it is possible. Dubas (1985) used videographic recordings to show that the same chromatophore muscle fibers could be activated by stimulation of different nerve bundles, supporting Florey’s (1969) electrophysiological evidence for polyneuronal excitatory innervation. From Dubas: "Usually, different muscle fibres were recruited by each nerve but sometimes a single muscle fibre responded to stimulation of each nerve. Variations of the stimulus voltage also produced gradation of the amplitude of shortening of individual muscle fibres. This supports the evidence above for multiple innervation of single muscle fibres."

      The petal-like distribution of motor-neuron influence shows overlapping territories, suggesting that some chromatophore sections may be influenced by multiple neurons. However, this overlap could arise from polyinnervation of individual muscles, the presence of gap junctions between muscles, or passive mechanical coupling due to the elastic properties of the pigment sac.

      The petal-like distribution of motor-neuron influence shows overlapping territories, suggesting that some chromatophore sections may be influenced by multiple neurons. However, this overlap could arise from polyinnervation of individual muscles, the presence of gap junctions between muscles, or passive mechanical coupling due to the elastic properties of the pigment sac.

      With the present approach, it is not possible to disentangle the relative contributions of these mechanisms, which will require targeted physiological or anatomical experiments. For this reason, we adopted a hard clustering approach for individual chromatophore slices.

      (9) All time should be labeled in seconds, not in frames, and all distances should be measured in um or mm, not in pixels.

      We chose to present figures in pixels and frames to reflect the native units of our recordings and analyses, which preserves fidelity and reproducibility of the computational pipeline. For biological interpretation, corresponding values are converted to µm in the main text, providing the relevant real-world scale. A scale for conversion is provided in the figure legend.

      Specific comments:

      (1) L36: I'm not sure the description of virtual chromatophores here is clear enough to make sense to a more general audience.

      Addressed. We retained the concept of ‘virtual chromatophores’ in the abstract and added a brief clarifying phrase to indicate that these are functional groupings of adjacent chromatophore territories that act as single units.

      (2) L50: "Rimmed by" - consider rephrasing.

      Addressed. Replaced with “surrounded”.

      (3) L64: "refs" - actual references aren't inserted. There are multiple other examples of this.

      Addressed. Added missing references.

      (4) L100: This section could use rewriting. Some of the text reads more like a figure legend.

      Addressed. We have streamlined the main text to reduce redundancy with the figure legend.

      (5) L101: Consider the opening sentence/s providing a more general introduction to the question and approach.

      Addressed.

      (6) L104: This implies that the data presented are from 14 animals of many ages. This is only relevant if the pooled data is analyzed and presented.

      We agree that the original phrasing was ambiguous. We have modified the sentence for clarity, and explain in the Methods that 14 animals were filmed to refine the pipeline and explore experimental conditions, while the analyses shown are from a single animal.

      (7) L111: HDBSCAN should be defined.

      Addressed. The acronym has been expanded.

      (8) L173: Figure 6D doesn't exist.

      Addressed. Reference to the inexistent 6d figure was removed.

      (9) L193: "excluding negative (contraction) phases" This phrase requires clarification.

      Addressed. Added “see Methods” in the legend and added clarification on the reasoning in Methods.

      (10) L204: Should explain why the switch to affinity-propagation clustering was made when a different method was used for E. berryi.

      Addressed in discussion.

      (11) Figure 3: I recommend including a diagram or image of a whole cuttlefish and showing what the corresponding imaging area was in relation to the animal so the reader gets an intuitive sense of scale.

      Thank you. We have added a supplementary figure to give the reader a sense of scale.

      (12) L221/Fig 3b: These colors are supposed to represent clusters of 3 to 5 chromatophores? The clusters look much bigger.

      The figure shows clusters of 3 to 5 chromatophores, but many adjacent clusters were assigned the same color. We have changed the colors to remove this ambiguity.

      (13) Figure 3c: This would be more powerful if it represented the combined data of many experiments to draw a general conclusion. Also, shouldn't these cluster sizes match those in 2e, e.g. they get as big as 40?

      We assume the reviewer is referring to a comparison between Figures 3c and 2e. For visualization purposes, the graph in 3c was truncated to display over 90% of the data, which explains why the largest clusters appear smaller than in 2e. We modified the legend accordingly. We agree that the results would be strengthened by pooling data from additional experiments; however, acquiring high-quality, artifact-free recordings suitable for motor unit analysis is extremely challenging. We hope that our framework will enable future studies to extend this analysis.

      (14) Figure 4: I would show some of these examples earlier, to give the reader an intuitive sense of the data and claims (though it doesn't need its own figure - provide a couple of examples, and the diagram of how much of the mantle you're sampling) then put the rest in the supplement, and include some videos too.

      We agree that providing spatial context is important for readers to develop an intuitive understanding of the dataset. However, introducing examples of motor units earlier in the manuscript would, in our view, interrupt the logical progression of the Results, where motor unit identification builds on prior analyses. To address the reviewer’s concern, we have added a new supplementary figure (Fig. S1) illustrating the size and location of the sampled mantle region. In addition, we now provide representative videos in the Data Availability section to give readers direct visual access to the underlying dynamics.

      (15) Figure 4f: Is the location of the split color in each dot accurate? It's surprising that each one is split down the middle, and the pink side is always on the right - this is unintuitive given where the motor neuron is likely to be located.

      The dots and half dots represent the membership of a chromatophore to a particular cluster.

      (16) Figure 5: I didn't find this figure sufficiently justified in the text. I would move this to the supplement.

      Addressed in General point #7.

      (17) L350: States that 12 animals were patched, but the data isn't shown. It's important to show all of this data (some of which can be in the supplement).

      Addressed. We provided the data in the Data Availability Section.

      (18) Figure 5: I would quantify how many chromatophores were in each motor group across all the recording sessions, and compare this to the equivalent behavioral analysis.

      We assume the reviewer means Fig. 6. We calculated and stated the size of motor units across patching sessions.

      (19) Figure 5c: I recommend labeling each panel with a different number so you can refer to specific data.

      We assume the reviewer means Fig. 6c. We consider the figure layout clear enough to allow readers to follow the data without additional panel numbers.

      (20) L379: Typo: repeat of "quantitative"

      Addressed.

      (21) L576: Salinity should be 33-36 ppt, not %

      Addressed.

      (22) L877: The salinity units are sg? That should be stated. Though I would use the same units for salinity throughout.

      Addressed.

      Overall, this work introduces a potentially valuable quantitative framework for studying chromatophore dynamics. Addressing the points above would substantially strengthen the manuscript and clarify the scope and support for its conclusions.

      We thank the reviewer for these many helpful comments.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 64 - missing references for chromatophore colour with age.

      Addressed. Added missing refs.

      (2) Line 64-65 - would be good to have a little more detail about what is meant by 'migrating through the skin'. Is this a lateral process, or depth in the skin?

      Addressed. Changed “migrating in the thickness..” with “through the thickness..” to emphasize verticality.

      (3) Line 72 - typo, should read '...individual and groups...'

      Addressed.

      (4) Remove 'In Fig 1, ...' from line 104.

      Addressed.

      (5) Figure 1 - It's unclear why some chromatophores are uncoloured with a red dot in the centre. Are these chromatophores that do not share a cluster with neighbours? If so, wouldn't it make more sense to colour the chromatophore with a unique colour of its own? Or, at the very least, make a note in the caption to indicate that all white chromatophores are not clustered with neighbours.

      Segmented chromatophores are shown in white, with coloured slices highlighting cluster membership. Uncoloured slices represent outliers. Addressed in the figure legend.

      (6) Line 119 - the concept of a 'closed virtual chromatophore' needs a few more words of explanation. The way I interpret the text as it is, is that the motor units driving colour change are not necessarily the individual chromatophores, but a motor region containing a mixture of whole and partial chromatophores innervated by the same motor neuron. If this is the case, a few extra words of description would help here to remove any ambiguity as I think this is an important concept for the paper.

      Addressed. We added a sentence clarifying the concept.

      (7) Line 173 - Figure 6d doesn't exist in the paper. Was a different panel intended? If so, please make sure to number the figures in order of appearance in the manuscript.

      Reference to the inexistent figure 6d was removed.

      (8) Figure 3b is very difficult to see. Perhaps consider lightening the background image. Please also indicate whether the individual colours refer to individual clusters. If this is the case, then some of these clusters look much larger than the 3-5 suggested in the caption.

      This issue has been corrected.

      (9) Line 210 - remove the bold type.

      Addressed.

      (10) Line 211 - please specify which 'two groups' you are referring to here. Presumably, this is anaesthetised and non-anaesthetised.

      Addressed.

      (11) I think that the text is missing any indication of the pixel sizes involved in extracting slice metrics, particularly from the S. officinalis data. It would be great to include some data on how many pixels span the radius of an expanded chromatophore. There is some small indication of this in Figure 2a, but a panel or two with details about the pixel size of S. officinalis chromatophores and their slices would be welcome. This would help with the judgment of the robustness of the resolution of the analysis. Looking at the y-axis in Figure 5a, there is some indication that the chromatophore radius is only 1 to 8 pixels. Is this the case?

      Figure 5a doesn’t show chromatophore radius but instead the relative change in peak amplitude during an expansion event. At that point the chromatophore has likely a larger radius as you sum the baseline radius of the chromatophore + the size of the peak.

      (12) Line 246-7 - reword this sentence to avoid referring to Figure 3d in the narrative. Include it in parentheses instead.

      Addressed.

      (13) Lines 408 and 409 - missing references.

      Addressed.

      (14) Line 576 - salinity should be reported in parts per thousand, not per cent.

      Addressed.

      (15) Line 593 - how were animals <50mm fed?

      Animals smaller than 50 mm were fed Neomysis spp. or small Palaemonetes spp., as noted a few lines above the description for animals larger than 50 mm.

      (16) Line 847 - typo - '...putative motor units' ramifications...'

      Addressed.

      (17) Line 854 - better to write out the [chrom_id, label] info as narrative text rather than using the variable names.

      Addressed.

      (18) Line 876 - two typos '...were reared in an artificial...'

      Addressed.

      (19) Line 877 - please use the same salinity metric as used in the earlier part of the methods.

      Addressed.

      (20) Section 898-910 - equipment details would ideally include the location of the company. E.g. (BX51W1, Olympus, Tokyo, Japan).

      Addressed.

      Reviewer #3 (Recommendations for the authors):

      I am left with a number of questions that arise from the authors' work, some of which the authors themselves briefly mention in the technical limitations section.

      (1) In relation to the first weakness, do the authors know if the recruitment patterns they identify are likely to be the same when octopi perform visually-mediated camouflage to their environment?

      Thank you for this comment. We assume the reviewer is referring to S. officinalis. There seems to be a misunderstanding: our approach is designed to reveal the smallest independent functional units—motor units—that together generate skin patterns. The technique is fully applicable to an animal displaying camouflage, but the results would necessarily differ. Camouflage patterns are composed of relatively large shapes compared to individual motor units and arise from the coordinated activation of multiple units. Disentangling motor units requires decorrelated activity, whereas visually-evoked camouflage inherently drives correlated motor-unit activation by premotor control. To use an analogy, if our goal were to map the distribution and wiring of pixels on a screen, it would be more informative to broadcast a noise signal rather than display coherent images, as the noise produces decorrelated activity that allows the underlying structure to be resolved. We have clarified this important point in the early results section.

      (2) The authors provide indirect evidence that motor neurons innervate multiple chromatophores. Can sets of radial muscles within a chromatophore be innervated by multiple motor neurons? Is there neuroanatomical evidence or experiments that could perhaps shed light on this?

      Addressed above. Same question as #1(8).

      (3) Are multi-innervated chromatophores evenly distributed across the octopus's body? For instance, could the authors compare chromatophore recruitment over multiple patches on the animal from multiple regions?

      At present, we do not have sufficient data to quantitatively compare motor-unit structure or the distribution of multi-innervated chromatophores across different body regions of cuttlefish. However, we would not necessarily expect uniformity across the skin, as distinct body regions are associated with characteristic pattern elements (e.g., the white square on the central mantle or the thicker zebra stripes along the sides). It is therefore plausible that different motor-unit geometries and densities are differentially represented across regions to support these region-specific patterns. Future recordings spanning multiple patches and body locations will be required to test this question directly.

      (4) Relatedly, is there any idea of whether chromatophore size or age corresponds with the number of motor units within a single chromatophore?

      At present, our analyses are limited to single developmental time points, and we therefore cannot directly assess whether chromatophore size or age correlates with the number of motor neurons innervating an individual chromatophore. However, this is a question that our analysis framework is explicitly designed to address. Our custom pipeline, CHROMAS, (Ukrow, Renard et al., 2025) includes tools for longitudinal image alignment that allow chromatophores to be tracked within the same animal across development. Applying these scripts to developmental datasets enables future analyses linking chromatophore growth or age to changes in the motor innervation of single chromatophores.

      I understand that a full resolution to the issues raised above may require substantial additional experiments. At a minimum, further discussion of these points with integration of existing literature would elevate the paper.

    1. eLife Assessment

      The paper presents a valuable finding that the human brain and models that incorporate sentence structures can capture sentence-level semantics beyond word meaning, while large language models behave differently. The evidence supporting the authors' claims is solid, though the stimuli are highly controlled and some analyses could be more thorough. This work will be of interest to researchers in language neuroscience and those developing language models.

    2. Reviewer #1 (Public review):

      Summary:

      This paper investigates whether transformer-based models can represent sentence-level semantics in a human-like way. The authors designed a set of 108 sentences specifically to dissociate lexical semantics from sentence-level information and collected 7T fMRI data from 30 participants reading these sentences. They conducted representational similarity analysis (RSA) comparing brain data and model representations, as well as the human behavioral ratings. It is found that transformer-based models match brain representation better than static word embedding baseline which ignores word order but fall short of models that encode the structural relations between words. The main contributions of this paper are:

      (1) The construction of a sentence set that disentangles sentence structure from word meaning.

      (2) A comprehensive comparison of neural sentence representations (via fMRI), human behavior, and multiple computational models at the sentence level.

      Strengths:

      (1) The paper evaluates a wide variety of models, including layer-wise analysis for transformers and region-wise analysis in the human brain.

      (2) The stimulus design allows precise dissociation between lexical and sentence-level semantics. The RSA-based approach is empirically sound and intuitive.

      (3) The constructed sentences, along with the fMRI and behavioral data, represent a valuable resource for studying sentence representation.

      Weaknesses:

      (1) The rationale behind averaging sentence embeddings across multiple transformer models (with different architectures and training objectives) is unclear. These transformer-based models have different training paradigms and model architectures, which may result in misaligned semantic spaces. The averaging operation may dilute the distinct sentence representations learned by each model, potentially weakening the overall semantic encoding for sentences. Please clarify this choice or cite supporting methodology.

      (2) All structure-sensitive models discussed incorporate semantics to some extent. Including a purely syntactic baseline, such as a model based on context-free grammar, would help confirm the importance of syntactic structures.

      (3) In Figure 2, human behavioral judgments show weak correlations with neural data, and even fall below those of computational models, suggesting the behavioral judgments may not reflect the sentence structures in a brain-like way. This discrepancy between behavioral and neural data should be clarified, as it affects the interpretation of the results.

      (4) To better contextualize model and neural performance, sentence similarity should be anchored to a notion of semantic "ground truth", such as the matrix shown in Figure 1a. Comparing this reference with human judgments, brain responses, and model similarities would help establish an upper bound.

      (5) The structure of this paper is confusing. For instance, Figure 5 is cited early but appears much later. Reordering sections and figures would enhance readability.

      (6) While the analysis is broad and comprehensive, it lacks depth in some respects. For instance, it remains unclear what specific insights are gained from comparing across brain regions (e.g., whole brain, language network, and other subregions). Similarly, the results of simple-average and group-average RSA appear quite similar and may not advance the interpretation.

      (7) While explaining the grid-like pattern due to sentence length is important, this part feels somewhat disconnected from the central question of this paper (word order). It might be better placed in supplementary material.

      Comments on revised version:

      The new version of the paper has addressed my main concerns, including:

      (1) clarification about the methodology of Transformer embeddings

      (2) discussion about the purely syntactic models

      (3) discussion about the low correlation between behavioural ratings and brain activations

      (4) better structure of the paper

      (5) clarification about pre-registration

      I believe the paper has been substantially improved after revision.

    3. Reviewer #3 (Public review):

      Summary:

      Large Language Models have revolutionized Artificial Intelligence and can now match or surpass human language abilities on many tasks. This has fuelled interest in cognitive neuroscience in exposing representational similarities between Language Models and brain recordings of language comprehension. The current study breaks from this mold by: (1) Systematically identifying sentence structures for which brain and Large Language Model representations diverge. (2) Accounting for such sentence structures using a model structured by semantic roles. As such the study may now fuel interest in characterizing how Large Language Models and brain representations differ, which may prompt new more brain like language models.

      Strengths:

      * This study presents a bold challenge to a literature trend that has touted similarities between Transformer models and human cognition based on representational correlations with brain activity. This challenge is substantiated by identifying sentences for which brain and model representations of sentences diverge.

      * This study conducts a rigorous pre-registered analysis of a comprehensive selection of the state-of-the-art Large Language Models, on a controlled sentence comprehension fMRI dataset. The analysis is conducted within a Representation Similarity framework to support similarity comparisons between graph structures and brain activity without needing to vectorize graphs. Transformer models are predicted and shown to diverge from brain representations on subsets of sentences with similar word-level content but different sentence structures.

      * The study introduces a 7T fMRI sentence comprehension dataset and accompanying human sentence similarity ratings which may be a fruitful resource for developing more human-like language models. Unlike other model-based sentence datasets, the relation between grammatical structure and word-level content is controlled, and subsets of sentences for which models and brains diverge are identified.

      Weaknesses:

      * The interpretation of findings is nuanced. Although Transformers underperform as brain models on the critical subsets of controlled sentences, a Transformer outperforms all other models when evaluated on the union of all sentences when both word-level content and structure vary. Transformers also yield equivalent or better models of human behavioral data. Thus, although Transformers have demonstrable flaws as human models which are pinpointed here, in the general case (some) Transformers are more human-like than the other models considered.

      * There may be confounds between the critical sentence structure manipulations and visual processing. This is inconvenient because activation in brain regions that process semantics tends to partially correlate with low-level representations of sentence surface features encoded in visual cortex. Although the study commendably controls for confounds associated with sentence length, correlations with the key sentence structure models are most salient in visual cortex and diminish in other brain networks when V1-V4 activation is controlled for.

      * Sentence similarity computations are emphasized as the basis for unifying comparative analyses of graph structures and vector data. A strength of this approach is that correlation is not always the ideal similarity metric. However, a weakness is that similarity computations are not unified across models. This has practical consequences because different similarity metrics applied to the same model produce positive or negative correlations with brain data and repeating analyses with a different representational dissimilarity measure seems to produce some anomalous results.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The rationale behind averaging sentence embeddings across multiple transformer models (with different architectures and training objectives) is unclear. These transformer-based models have different training paradigms and model architectures, which may result in misaligned semantic spaces. The averaging operation may dilute the distinct sentence representations learned by each model, potentially weakening the overall semantic encoding for sentences. Please clarify this choice or cite supporting methodology.

      The reviewer questions the rationale for averaging sentence embeddings across different models. However, our method involves computing correlations separately for each model, then averaging the correlations. We apologize for the confusion. We have clarified this on page 3:

      “Results for the ‘Transformers’ model are computed by computing correlations separately for five different transformer models and then taking a simple average of these correlations. Results for each individual transformer are presented in Supplementary Information Figure S2.”

      (2) All structure-sensitive models discussed incorporate semantics to some extent. Including a purely syntactic baseline, such as a model based on context-free grammar, would help confirm the importance of syntactic structures.

      Following the suggestion, we have implemented two syntactic models and discuss the results on page 10:

      “We also found that purely syntactic models based on constituency parses (see Benepar and CFG) show poor correlations with brain activity (see Supplementary Information Figure S2). Examining the corresponding RSA matrices (see Figure S1), this seems to be due to such models being overly sensitive to syntactic form, and relatively insensitive to which words are assigned to different nodes within the syntactic tree. This is most evident for the edit-distance similarity metric, and to a lesser extent also for the subtree similarity metric. This finding highlights the value of hybrid approaches designed to appropriately balance sensitivity to lexical, syntactic, and compositional information in representing semantic information at the sentence level.”

      (3) In Figure 2, human behavioral judgments show weak correlations with neural data, and even fall below those of computational models, suggesting the behavioral judgments may not reflect the sentence structures in a brain-like way. This discrepancy between behavioral and neural data should be clarified, as it affects the interpretation of the results.

      While the behavioural judgements are made by different participants and involve a different task than the neuroimaging results, nonetheless we agree the difference is surprising and warrants more detailed consideration. We have included a more detailed discussion of this issue on page 11:

      “Our study has several limitations. First, we found a surprisingly low correlation between behavioural ratings and brain activations (see Figure 2). This may be partly explained by differences in task structure. In the behavioural experiment, participants viewed many pairs of related sentences, and were explicitly asked to pay attention to differences in the words of each sentence. In contrast, in the fMRI task, participants read one sentence at a time without an explicit comparison. In addition, we suspect that presentation of so many sentence pairs with highly similar structures may have biased the way in which participants rated sentence similarity. Modifications to the behavioural task to mitigate these aspects may reduce the divergence between behavioural and brain findings.”

      (4) To better contextualize model and neural performance, sentence similarity should be anchored to a notion of semantic "ground truth", such as the matrix shown in Figure 1a. Comparing this reference with human judgments, brain responses, and model similarities would help establish an upper bound.

      While our design matrix served as the basis for constructing a set of stimuli with systematic modifications, we respectfully suggest that it should not be regarded as a ‘semantic ground truth’. Sentence pairs within each category will not have the same degrees of semantic similarity since the words and context differ across sentences in a graded manner. Furthermore, while we anticipated ‘different’ sentence pairs would be less similar than ‘swapped’ sentence pairs, and that within each of the six block diagonals the ‘modified’ or ‘substituted’ sentence pairs would be the most similar, we did not have any prediction about the magnitude of these differences. Our goal was to construct a set of sentence pairs which spanned a range of semantic similarities, and allowed for dissociation between lexical similarity and overall similarity in meaning. The design matrix is not intended to represent a ‘ground truth’ that human judgements or brain representations would be expected to conform with.

      (5) The structure of this paper is confusing. For instance, Figure 5 is cited early but appears much later. Reordering sections and figures would enhance readability.

      We agree that placement of figures was not ideal in the previous draft. We have reworked the manuscript so that all figures appear closer to their mention in the text, and the figure (now Figure 3) appears in the correct order. We have also substantially revised the discussion, and included subheadings to help guide the reader through the various different issues we include.

      (6) While the analysis is broad and comprehensive, it lacks depth in some respects. For instance, it remains unclear what specific insights are gained from comparing across brain regions (e.g., whole brain, language network, and other subregions). Similarly, the results of simple-average and group-average RSA appear quite similar and may not advance the interpretation.

      We included both analyses in line with our preregistration, and also because we believe the fact that two distinct approaches to analyzing the data yield similar results strengthens our conclusions.

      (7) While explaining the grid-like pattern due to sentence length is important, this part feels somewhat disconnected from the central question of this paper (word order). It might be better placed in supplementary material.

      We believe that the grid-like pattern in the RSA results is an important unexpected finding that warrants discussion in the main manuscript.

      Reviewer #1 (Recommendations for the authors):

      (1) Consider including a purely syntactic baseline model. For instance, parse each sentence into a constituency tree and compute tree edit distances between pairs of trees. This would allow you to construct a sentence similarity matrix based solely on syntactic structure, and may clarify the role of syntax in sentence representations.

      See our response to Public Review comment 2.

      (2) Instead of averaging embeddings across different transformer-based models, I recommend reporting RSA results for each model individually. For instance, compare one sentence-level model (e.g., SentBERT or SimCSE) and one general-purpose language model (e.g., GPT-2 or Llama).

      See our response to Public Review comment 1.

      (3) I suggest revisiting the structure of the Results section to improve the clarity and impact of your key findings. Consider which results are most central to the paper's claims and ensure they are presented in the main text. Less central analyses (e.g., the analysis on the grid-like pattern) might be better suited for the supplementary information. Presenting behavioral results prior to neuroimaging results could also improve logical flow by first validating model similarity estimates behaviorally.

      As mentioned in our response to Public Review comment 5, we have revised the ordering of the figures to improve the flow of the main manuscript. We believe that the grid-like pattern in the RSA results is an important unexpected finding that warrants discussion in the main manuscript. In addition, we believe that presenting the neuroimaging results first is appropriate as this is the primary and most important contribution of our study.

      Reviewer #2 (Public review):

      (1) The stimuli are not fully controlled for lexical content across conditions. Residual lexical differences between sentences could still influence both brain and model similarity patterns. To more cleanly isolate syntactic effects, it would be useful to systematically vary only a single structural element while keeping all other lexical content constant (e.g., the boy kicked the ball / the ball kicked the boy). It would be better to engage more with the minimal pair paradigm, which is widely used in large language model probing research.

      The reviewer rightly argues that our stimuli do not fully control for lexical content across conditions, and that a more appropriate paradigm may be to utilise minimal pairs in which only a single variable of interest (such as sentence structure) is modified. We agree that most of our sentence pairs do not constitute minimal pairs; however, this was not our objective. Our study design aimed to synthesise traditional minimal pair approaches with more recent research paradigms using naturalistic stimuli. As such, we selected stimuli which are more complex and contain more variable features than traditional minimal pair studies, but which also are tailored to highlight differences which are of particular theoretical interest.

      Because we are interested in comparing the effects of multiple sentence elements and semantic roles, a systematic pairwise comparison of minimal pairs is not necessarily optimal. Instead, we designed our stimuli to leverage the advantage of fMRI in that we can measure the brain representations corresponding to each sentence, and hence can conduct a full series of pairwise comparisons of sentence representations. We do not claim this approach to be universally superior to a minimal pair approach, but we do believe our novel approach provides additional insights and a new perspective on semantic representation relative to minimal pair studies.

      We have added the following paragraph on pages 9-10 contrasting our approach to previous minimal-pair studies:

      “Another approach that has seen widespread use is the presentation of minimal sentence pairs that differ only in one specified aspect, for example, interchanging subject and object in a sentence (Frankland 2015, Wang 2016, Frankland 2020, Giglio 2024), or altering adjective-noun phrases to influence composition (Graves 2010, Schell 2017, Fyshe 2019, Ciapparelli 2025). Our approach is an extension of these approaches utilising more naturalistic and complex sentences, designed to facilitate comparison of a wider range of structural manipulations (see Table 1). In more completely characterising the representational structure of various computational models in response to different structural contrasts, we can more comprehensively evaluate their adequacy as models of semantic processing in the brain.”

      (2) The comparisons are done across fundamentally different model types, including static embeddings, graph-based parsers, and transformers. The inherent differences in dimensionality and training objectives might make the conclusion drawn from RSA inconclusive. Transformer embeddings typically occupy much higher-dimensional, anisotropic representational spaces, and their similarity structure may reflect richer, more heterogeneous information than models explicitly encoding semantic roles. A lower RSA correlation in this study does not necessarily imply that transformers fail to encode syntactic information; rather, they may represent additional aspects of meaning or context that diverge from the narrow structural contrasts probed here.

      The reviewer notes that low RSA correlations do not necessarily imply that transformers fail to encode syntactic information. We acknowledge this in our discussion (page 10), where we also highlight that our focus is not on whether transformers encode such information, but rather what transformer representations can tell us about how sentence structure is represented in the brain. Our results indicate that transformer embeddings do not have the same geometric properties as brain representations of sentence meaning, at least for certain types of sentences where lexical information is insufficient to determine overall meaning.

      The reviewer also notes that transformer embeddings are highly anisotropic; however, we adjust for this by normalising each feature as discussed on page 14. Finally, the reviewer notes that the transformers we examine differ in architecture and training objectives. This is not critical for our study because we are not seeking to determine which architecture or training objectives are best. Our goal is simply to compare a range of approaches and see which, if any, have similar sentence representations to those formed by the brain. In fact, our results indicate that architecture and training regime make relatively little difference for our stimuli, as shown by the pattern of results for all models in Figure S2.

      (3) The interpretation of the RSA correlation largely depends on the understanding of models. The authors suggest that because hybrid models correlate better than transformers, this implies that transformers are inferior at representing syntax. However, this is not a direct test of syntactic ability. Transformers may encode syntactic information, but it may not be expressed in a way that aligns with the RSA paradigm or the chosen stimuli. RSA does not reveal what the model encodes, and the models might achieve a good correlation for non-syntactic reasons (e.g., length of sentence, orthographic similarity, lexical features).

      The reviewer argues that RSA correlations do not measure the extent to which a model encodes syntactic information. This is very similar to the previous point. We do not claim that our results show that transformers do not encode syntactic information. Rather, our claim is that sentence embeddings derived from transformers have different geometric properties to brain representations, and that brain representations are better described by models explicitly representing key semantic roles. From this we conclude that, at least for the sentences we present, the brain is highly sensitive to semantic roles in a way that transformer representations are not (at least to the same extent). We have clarified this in a modified paragraph on page 11:

      “We emphasise that our results do not show that transformers fail to represent syntactic or semantic role information. Indeed, large language models show clear capabilities of correctly interpreting sentence structure (Chang 2024), and probing studies have found that transformers represent information about syntax and word order (Clark 2019, Manning 2020). This is consistent with our finding that directly prompting GPT-4 to rate sentence similarity yields very high correlations with human judgements (see Supplementary Information Figure S3). Nonetheless, the fact that transformers can encode and utilise structural information to perform linguistic tasks does not mean that they effectively utilise this information to construct a brain-like representation of sentence meaning.”

      We also respectfully disagree with the reviewer’s suggestions that sentence length and orthographic or lexical similarities may drive model correlations with brain activity. As we discuss on page 19, we explicitly control for differences in sentence length when computing correlations. Our process for constructing our sentence set also controls for lexical similarity by generating pairs of sentences with all or mostly the same words but different orderings. We did not explicitly address orthographic similarity, but this will be strongly correlated with lexical similarity.

      Reviewer #2 (Recommendations for the authors):

      (1) Model dimensionality: the interpretability of cosine similarity diminishes as the dimensionality increases, and there are some math tricks to work around it. To make a fair comparison among models with different dimensionalities, it would be better to apply some dimensionality-insensitive distance metrics.

      We thank the reviewer for this suggestion. We repeated all vector-based similarity calculations using the Dimension Insensitive Euclidean Metric (DIEM). As shown in Figure S9, the results are broadly similar, though with overall somewhat lower brain correlations for most transformers compared to cosine similarity.

      (2) Depending on the scope of the current study, if the authors would like to establish whether transformers are inferior to graph-based models in representing syntax, a linear classifier using the model embeddings would be sufficient. I think this would be a more direct assessment of model syntax ability than correlation with brain data.

      As we discuss in our previous responses, our objective in this study was not to assess how well transformers can represent syntax. Rather, the goal was to assess whether internal transformer representations have similar geometric properties to patterns of brain activation. Our results indicate that transformers do represent sentence structure, but in a different manner to the human brain.

      Reviewer #3 (Public review):

      (1) The interpretation of findings is nuanced. Although Transformers underperform as brain models on the critical subsets of controlled sentences, a Transformer outperforms all other models when evaluated on the union of all sentences when both word-level content and structure vary. Transformers also yield equivalent or better models of human behavioral data. Thus, although Transformers have demonstrable flaws as human models, which are pinpointed here, in the general case, (some) Transformers are more human-like than the other models considered.

      The reviewer argues that we overstate some of our conclusions, as several transformers achieve higher brain correlations than the hybrid model when computed over all sentence pairs, as well as on the behavioural data. In response, we first note that our primary interest in this paper is on the block diagonal sentence pairs, as these were specifically designed to interrogate how different models represent sentence structure. The comparison with all sentence pairs is presented for comparison but is not our primary focus on this paper, as also reflected in the pre-registered prediction that our VerbNet-CN hybrid model would show higher brain correlations than transformers over this block diagonal subset.

      Second, we have included a new analysis in the revised manuscript (Figure S9) where we compute brain correlations controlling for the pattern of similarities observed in the primary visual cortex (averaged over participants), as a way to control for visual similarity. This added control substantially reduces the brain correlations of the transformers, such that they all have lower correlations than VerbNet-CN and AMR-smatch even over the set of all sentence pairs. We provide interpretation of this result in the discussion.

      Third, we would like to note one of the disadvantages of transformers as a model of mind or brain representations is that they are largely a ‘black box’ whose workings are poorly understood. One advantage of hybrid models like our simple semantic role model is that they can be much easier to interpret, thereby enabling them to be used to determine which features are most important for brain representations of sentence meaning, and what mechanisms are used to combine individual words into a full sentence. Given their relative simplicity and interpretability, we believe hybrid models have considerable value as scientific tools, even in cases where they achieve comparable correlations to transformers. We have added a short discussion of this issue in the revised manuscript (page 10).

      (2) There may be confounds between the critical sentence structure manipulations and visual representations of sentence stimuli. This is inconvenient because activation in brain regions that process semantics tends to partially correlate with visual cortex representations, and computational models tend to reflect the number of words/tokens/elements in sentences. Although the study commendably controls for confounds associated with sentence length, there could still be residual effects that remain. For instance, the Graph model correlates most strongly with the visual cortex despite these sentence length controls.

      We agree with the reviewer that this is a potential confound. As noted in the previous response, we have implemented a new control analysis in which we directly control for visual similarities as reflected in participant-averaged similarities of primary visual cortex activations in response to all stimuli. These results are shown in Figures S8-S11 in the SI. We show that transformer correlations are reduced much more than graph and hybrid models with this control. Also, we note that the AMR-smatch graph model shows high correlations with other brain regions even after removing correlations with the visual cortex (Figure S10). This indicates that the model represents a range of sentence features, including both superficial visual or length-related features, as well as semantic features that are represented in common with language and other cortical regions.

      (3) Sentence similarity computations are emphasized as the basis for unifying comparative analyses of graph structures and vector data. A strength of this approach is that correlation is not always the ideal similarity metric. However, a weakness is that similarity computations are not unified across models. This has practical consequences here because different similarity metrics applied to the same model produce positive or negative correlations with brain data.

      The reviewer notes that the method for computing similarities differs between the vector-based (mean and transformer) models, and the hybrid and syntax-based models, thereby potentially adding an additional confound to our results. We agree that this is a potential limitation, and our correlations should always be understood as applying to a model paired with a similarity metric. However, we believe that this is mostly unavoidable when comparing different formalisms. In the revised manuscript we have incorporated an entirely new similarity metric for vector-based models (DIEM similarity), as well as an extended discussion of the effect of different similarity metrics for graph and hybrid models.

      Reviewer #3 (Recommendations for the authors):

      (1) Compute separate RSAs on each sentence pair type (especially Swapped), to quantify how each sentence type manipulation contributed to the divergence between model and brain. Although the manuscript is already brimming with analyses, I think squeezing this in would be helpful because the results currently rely on qualitative inspection of group-average scatter plots to interpret how sentence pair manipulations contributed to the divergence between Transformers and humans. The Swapped condition would appear to be the centrepiece of the title and manuscript, and potentially the only condition for which confounds associated with the surface form of sentence are controlled for (because sentences should be the same words in different orders). Thus, this analysis might see to the inconvenient visual cortex correlations in Figures 3d/e.

      We respectfully disagree that computing separate RSA for each sentence pair type would be a useful additional analysis. The motivation for the construction of our stimulus set was to provide a range of variants of a given base sentence that alter the semantic meaning and lexical content (somewhat) independently. The purpose of the ‘modified’ sentences, for instance, is to construct sentences with a similar overall meaning but lower lexical similarity due to the inclusion of many modifier words. It is precisely the comparisons across the different pair types that provide information about how each model represents sentence semantics, so restricting an analysis to only a single subset would not be very informative. Another problem with this approach is that it would dramatically reduce the number of sentence pairs analysed, thereby decreasing statistical power. In the revised manuscript we have provided additional details regarding the motivation and rationale for how our stimulus set of 108 sentences was constructed, which should help to elucidate this point more clearly. The following excerpt is from page 3:

      “Within each of the six subsets, we begin with a base sentence such as `the cameraman brought the equipment to the director', which we then systematically modified in various ways to create different combinations of lexical and compositional similarity, in order to dissociate these two aspects of meaning (see Table 1 for further details).”

      (2) Explaining the motivation for the sentence stimulus types. I appreciated the careful design of the dataset, but I couldn't immediately work out the motivation for all the different sentence types, and why this selection was ideal to identify divergences with Transformers. For instance, given the goal of (approximately) controlling for lexical similarity whilst varying sentence meaning, I couldn't immediately see why stimulus blocks weren't all built from rearranging the same content words (as in the Swapped condition). The negative RSA correlation with the Mean model also made me stop and think - it seems like the more similar the words in a sentence, the more different their structure, and vice versa, but I wasn't clear that this was a design feature. Thus, a few extra words motivating the conditions could be helpful for the reader, and these might helpfully lead them to anticipate the negative RSA correlation.

      As noted in the previous response, in the revised manuscript we have expanded our explanation of the rationale for the construction of our 108 sentences. In particular, Table 1 in the methods section now includes two additional columns which summarise the intended combinations of lexical and overall sentence similarity which our sentence pairs are intended to satisfy.

      (3) Explanation for why different implementations and similarity computations between variants of ostensibly equivalent Graph / Hybrid models yielded widely divergent positive vs negative brain correlations, despite both positively capturing behavioural ratings. This might incorporate a brief intuitive explanation of how Graph model similarities were computed (e.g., what SMATCH and WWLK do). In light of the above, why do different similarity algorithms applied to the Graph model yield positive and negative correlations on the same brain (e.g., Figure S2 - Graph / Graph-WL a,b, diag-pairs). Same goes for why Hybrid and Hybrid-AMR yielded positive vs negative correlations (e.g., Figure S2 - Graph / Graph-WL a,b, diag-pairs). Acknowledge that the brain results are sensitive to similarity computations in the Discussion.

      We appreciate this suggestion. We have added an extended consideration of these issues to the discussion (pages 10-11), as well as some additional details regarding the differences between the Smatch and WWLK metrics in the methods section (page 17).

      (4) Acknowledgement and explanation of why the human similarity ratings were poor at explaining brain data in Figure 2a,b (right column diag-pairs). The poor behaviour vs brain match is indirectly implied in the Discussion as "the comparison between behavioural and fMRI data is somewhat difficult owing to the difference in task structure." However, I would suggest being upfront and explicitly mentioning and explaining the poor brain match in Figures 2a and b, because the reader will notice and wonder - especially because the models correlate strongly with the behavioural data without the models doing the human behavioral task (though this could be a possibility, see later).’

      As suggested, we have included a passing reference to this in the presentation of our main results in page 5, and a lengthier discussion on page 11:

      “Our study has several limitations. First, we found a surprisingly low correlation between behavioural ratings and brain activations (see Figure 2). This may be partly explained by differences in task structure. In the behavioural experiment, participants viewed many pairs of related sentences, and were explicitly asked to pay attention to differences in the words of each sentence. In contrast, in the fMRI task participants (who were not the same as the behavioural task participants) read one sentence at a time without an explicit comparison. In addition, we suspect that presentation of so many sentence pairs with highly similar structures may have biased the way in which participants rated sentence similarity. Modifications to the behavioural task to mitigate these aspects may reduce the divergence between behavioural and brain findings.”

      (5) Brief explanation of why model vs brain correlations tended to be strongest in the visual cortex (Figure 3d,e). Currently, this issue is only mentioned in passing, however, it seems worthy of further comment.

      We appreciate the reviewer for highlighting this issue. We have added discussion of the potential for visual confounds to several points in the revised manuscript, including the ‘Neuroscience of semantics’ subsection on page 11. As noted, we have also added a new analysis in which we compute correlations controlling for the average RSA similarities of the primary visual cortex. We find that this additional control significantly reduces correlations for most transformer models, but only has a more modest reduction on the correlations for most of the graph and hybrid models, particularly VerbNet-CN (see Figures S8-S11).

      (6) Softening/clarifying some statements that could be misconstrued as suggesting Transformers were universally inferior models. Statements made in the Abstract/Discussion initially came over to me as implying that Transformers were universally inferior models when compared to the Graph/Hybrid models - but this appears only to be true when one looks at analyses conducted within block diagonal sentence subsets. Otherwise, when analyses are conducted on all sentences (between and within blocks, Figure 5) Llama 3 L2 provides by far the strongest brain model. Transformers also appear to yield the strongest accounts of the behavioural data, whether tested on block diagonal or all sentence pairs (Figure S3). To remedy this, I would suggest softening some statements in the Abstract/Discussion that could be misconstrued as suggesting that Transformers were universally inferior. I would also suggest explicitly acknowledging that when the entire dataset was analyzed, Transformers were most accurate, and that (some) Transformers best accounted for the behavioural data.

      We agree that there was some lack of precision in certain sections of the previous draft regarding the conclusions to be drawn regarding the representational capacities of transformers. We have revised the abstract and conclusion to better reflect our intended message, which is that transformers certainly can represent sentence structure and semantic roles, but that the way in which they do this (through vector representations in their hidden layers) is significantly different to how such features are represented in the human brain. In particular, we have included this new text on page 10:

      “We emphasise that our results do not show that transformers fail to represent syntactic or semantic role information. Indeed, large language models show clear capabilities of correctly interpreting sentence structure, and probing studies have found that transformers represent information about syntax and word order. This is consistent with our finding that directly prompting GPT-4 to rate sentence similarity yields very high correlations with human judgements (see Figure S3). Nonetheless, the fact that transformers can encode and utilise structural information to perform linguistic tasks does not mean that they effectively utilise this information to construct a brain-like representation of sentence meaning.

      (7) Given that GPT-4 was already deployed to parse semantic roles for the hybrid model, and GPT-4 should be able to generate reasonable similarity ratings between sentence pairs, it struck me that an interesting addendum could be to use GPT-4 similarities derived from the human behavioral task to interpret both brain and human behavioral data. This might also help support the case for conducting analyses within a similarity-based framework.

      We appreciate this suggestion. We have added this model (GPT-4 ratings of sentence similarity) to the revised manuscript (see Figures S1-S3).

      Other changes

      As noted by reviewer 3, the full set of sentence pairs was missing from the previous draft. They have been added to the SI of the revised manuscript.

      We have renamed the Graph and Hybrid models in the manuscript to AMR-Smatch and Verbnet-CN respectively, for greater clarity as to which models these terms refer to, and also to better differentiate from the newly added constituency parse graph models.

      We have thoroughly revised the discussion section, incorporating feedback from all reviewers regarding areas needing additional depth.

      We have added subsections to the discussion to aid the reader navigating the now lengthier section.

    1. eLife Assessment

      This valuable study uses technically compelling long-term in vivo recordings and computational modeling to investigate whether hawkmoth olfactory receptor neurons show circadian modulation of spontaneous firing. The authors further propose the provocative model that post-translational mechanisms, rather than the transcriptional-translational processes, may contribute to circadian regulation of neuronal excitability. However, the evidence for circadian firing in these neurons, and for post-translational modification of Orco as the underlying mechanism, remains incomplete. In contrast, the study does provide strong evidence that the application of cyclic nucleotides can modulate Orco-dependent activity at a single time point, and reports that the temporal pattern of Orco transcript abundance is not circadian. However, the findings are incomplete to exclude a role for transcriptional-translational mechanisms and their associated multi-layered controls in circadian regulation.

    2. Joint Public Review:

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several assumptions that underlie their data analysis and model builds, as well as insufficient biological data including critical controls to validate and/or fully justify the model the authors are proposing.

      Strengths:

      The authors raise several intriguing model-based hypotheses regarding the mechanisms that underlie the generation of olfactory rhythms. The electrophysiological approach and the long-term recording paradigm are elegant and technically impressive. In the revised version, the authors have added additional qPCR data supporting the lack of rhythmic Orco transcript expression and included a new figure suggesting that cAMP can modulate Orco conductance.

      Major weaknesses:

      (1) The cAMP experiment was only conducted at one time-point, which is insufficient to support the central claim that "AMP and cGMP may have ZT-dependent effects on Orco conductivity".

      (2) The revised manuscript continues to rely heavily on prior publications or defers key mechanistic questions (or important manipulations) to future studies. In its current form, the evidence presented remains insufficient to support the central claim that a PTFL constitutes the primary underlying circadian clock mechanism. The proposed model is intriguing, but the data provided do not yet directly demonstrate the novel mechanism.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several questionable assumptions that underlie their data analysis and model builds, as well as insufficient biological data, including critical controls to validate and/or fully justify the model the authors are proposing.

      We thank the reviewers for their thorough and thoughtful comments and believe that the manuscript is much stronger now after the revision which incorporates the requested changes. We added results of new experiments and additional analyses. Although these new insights did not change the previous conclusions, we significantly reworked the Discussion and added further references to clarify the conclusions we want to make.

      Please note that we used ORN as acronym for “olfactory receptor neuron” throughout the manuscript. ORNs contain odorant receptors (ORs), and in insects these ORs associate with the olfactory receptor co-receptor (Orco) to be trafficked to the membrane of the cilium of the ORN, where they can be contacted by pheromones and odorants. In Manduca sexta, evidence is accumulating for G-protein coupled metabotropic pheromone transduction and not for OR-Orco dependent ionotropic transduction, as shown for Drosophila melanogaster. In both insect species, besides its chaperone function, Orco can form leaky cation channels, which can regulate the spontaneous spiking activity of ORNs. In this study, we explored this role of Orco.

      Strengths:

      The study is notable for its combination of long-term in vivo tip recordings with computational modeling, which is technically challenging and adds weight to the authors' claims. The link between Orco, cyclic nucleotides, and circadian regulation is potentially important for sensory neuroscience, and the modeling framework itself - a stochastic Hodgkin-Huxley formulation that explicitly incorporates channel noise - is a solid and forward-looking contribution. Together, these elements make the study conceptually bold and of clear interest to circadian and olfactory biologists.

      Major weaknesses:

      At the same time, several limitations temper the conclusions. The pharmacological evidence relies on a single antagonist and concentration, without key controls. The circadian analysis is based on relatively small numbers of neurons, with rhythms detected only in subsets, and the alignment procedure used in constant darkness raises concerns of bias. The molecular evidence is sparse, with only three qPCR timepoints, and the model, while creative, rests on assumptions that are not yet fully supported by in vivo data.

      Please see our responses to the detailed comments.

      Detailed comments are provided below:

      (1) The role for Orco proposed in the authors' model largely stems from the effects seen following the administration of (a single dose) of the Orco antagonist, OLC15. However, this hypothesis is undercut by the lack of adequate pharmacological controls, including a basic multipoint OLC15 dose-response series in addition to the administration of blockers for the other channels that are embedded in their model, but which were ruled out as being involved in the modulation of biological rhythms. In addition, these studies would (ideally) also benefit from the inclusion of the same concentration (series) of an inactive OLC15 analog to better control for off-target effects.

      The Orco agonist VUAA1 (Jones et al., 2011) binds directly to Orco and increases the channel open time probability. In M. sexta hawkmoths, we have already published that VUAA 1 increases the low spontaneous activity of ORNs in a dose-dependent fashion (Nolte et al., 2013). Chen and Luetje (2012) systematically varied the chemical structure of VUAA1 to identify new Orco ligands and discovered 22 Orco ligand candidates (OLCs) that either activated or inhibited Orco. In their heterologous expression system, Orco was most sensitive to inhibition by OLC15. Based on these results, we published a dose-response curve of OLC15 inhibition (1-100 µM) using in vivo tip recordings of pheromone-sensitive long trichoid sensilla of M. sexta (Nolte et al., 2016). There, we also demonstrated that OLC15 dose-dependently antagonizes the VUAA1-dependent activation of Orco.

      Furthermore, we tested other published Orco antagonists, which were characterized in heterologous assays, in primary cell cultures of hawkmoth ORNs, as well as in in vivo assays in intact hawkmoths. We focused on amiloride-derived antagonists, because we previously identified an amiloride-sensitive cation channel in hawkmoth ORNs. We found that, in contrast to OLC15, the amilorides HMA and MIA were not Orco-specific antagonists but instead affected different ion channel targets depending on the time of day (Nolte et al., 2016). Based on those experiments and the dose-response curves we determined that the Orco agonist VUAA1 (Jones et al., 2011) and the Orco antagonist OLC15 (Chen and Luetje, 2012) worked best in hawkmoth ORNs to target Orco pharmacologically. Due to those results and other comparative tests with other published Orco antagonists we settled since then in all further experiments on a dose of 50 µM OLC15 as most adequate to antagonize Orco functions in Manudca. In the current study, we focus on Orco without excluding the possibility that other ion channels in the ORNs contribute to the control of membrane potential rhythms.

      We have clarified the Methods section accordingly.

      (2) The expression pattern of Orco was assessed using qPCR at only three timepoints. Rhythmic transcripts can easily be missed with such sparse sampling (Hughes et al., 2017). A minimum of six evenly spaced timepoints across a 24-hour cycle would be required to confidently rule out circadian transcriptional regulation. In addition, the use of the timeless mRNA control from another study is not acceptable. Furthermore, qPCR analysis measures transcript abundance, not transcription, as the authors repeatedly state. Transcriptional studies would require nuclear run-off or, more recently, can be done with snRNAseq analysis. Taken together, these concerns undermine the authors' desire to rule out TTFL-based control that directly led them to implicate a PTTF-based model.

      We agree with the referees that more time points and a direct comparison between timeless and Orco mRNA levels should be included in this manuscript. We included these additional qPCR experiments and edited the manuscript to make clear that we measure transcript abundance, but we will not perform snRNAseq analysis due to time- and financial constraints.

      (3) The modelling presented is based on Orco as a ZT-dependent conductance tied to the cAMP oscillations that were reported by this group in the cockroach and from the presence and functionality in Manduca of homomeric Orco complexes that are devoid of tuning ORs. While these complexes have been generated in cell culture and other heterologous expression systems, as well as presumably exist in vivo in the Drosophila empty neuron and other tuning OR mutants, there is no evidence that these complexes exist in wild-type Manduca ORNs. While this doesn't necessarily undermine every aspect of their models, the authors should note the presence of Orco/OR complexes rather than Orco homomeric complexes.

      Our ELISAs found circadian oscillations in cAMP levels not only in antennae of the Madeira cockroach (Schendzielorz et al., 2014, 2012), but also in hawkmoth antennae (Schendzielorz et al., 2015). For clarification, we added the 2015 citation to the Modeling chapter in the Methods section.

      We agree with the referees that we cannot distinguish between Orco homo- and heteromers in the different compartments of our hawkmoth ORNs but we know that both are expressed in the pheromone-sensitive ORNs. Thus, as the referee suggests, we added text regarding the presence and localization of OR-Orco heteromers. Consistent data collected across different experiments (heterologous expression systems, primary cell cultures of hawkmoth ORNs, in vivo/in situ studies) support that Orco homomers are present in hawkmoth ORNs. In addition to co-expression of MsexOrco and MsexSNMP-1 with either MsexOr-1 or MsexOr-4 in a heterologous expression system, MsexOrco expression alone was already sufficient to increase intracellular Ca<sup>2+</sup> levels spontaneously as a result of its property as leaky, non-specific cation channel, and in response to VUAA1 application (Nolte et al., 2013). Both in developing hawkmoth pupae and differentiating primary cell cultures of hawkmoth ORNs, Orco expression started during a developmental time window where ORNs did not yet express pheromone receptors but where Orco affected spontaneous activity and intracellular Ca<sup>2+</sup> levels dependent on VUAA1 (Nolte et al., 2016). In vitro patch clamp studies of differentiating cultured hawkmoth ORNs during this time window of pupal development characterized ion channels/currents with properties of Orco as a leaky, non-specific cation channel/current that depends on protein kinase C and cyclic nucleotides (Dolzer et al., 2021, 2008; Krannich and Stengl, 2008; Stengl, 1993). Thus, Orco homomers are present in developing hawkmoth ORNs during a time window where ORNs already express spontaneous activity but they do not heteromerize with pheromone receptors. However, we do not know whether and in what ratio homo- and heteromers of Orco and ORs are present in the respective sensillum compartments of adult hawkmoths because all OR-specific antibodies tested did not work in immunocytochemical studies of hawkmoth antennae (Nolte et al., 2013; Stengl, 1994; Stengl and Hildebrand, 1990). Our hypothesis of differential distribution of Orco homomers in the some and dendrite compartment, and OR-Orco heteromers in the cilia is based on differential immunocytochemical localization of Drosophila ORs mainly in the cilia compartment (Benton et al., 2006).

      We clarified our manuscript accordingly.

      (4) Some aspects of the authors' models, most notably the decision to phase align/optimize their DD and OLC15 recordings, are likely to bias their interpretations.

      It is consensus that insects display daily and circadian rhythms in pheromone-dependent mating, odor-gated feeding, and egg-laying behavior that phase-locks to environmental rhythms, corresponding with daily/circadian rhythms of sensory neuron physiology (e.g., Merlin et al., 2007; Rymer et al., 2007; Schendzielorz et al., 2015, 2012). However, circadian rhythms can be easily masked by stress, like the disturbances during an experimentally very challenging long-term recording experiment over several days. In addition, we observed over the years in our animal raising facility that in 17:7 light-dark cycles the originally nocturnal hawkmoths M. sexta distribute their activity patterns over the course of the day, finding nocturnal as well as diurnal hawkmoths. Thus, light-dark cycles were not enough to ensure phase-synchronized behavioral rhythms, and it is very likely that the nocturnal hawkmoths, next to stress signals, rely heavily on pheromone/odor dependent synchronization as also found in other moth species (Ghosh et al., 2024). Because we focus on spontaneous activity and not on pheromone-dependent physiology in this study, we used isolated males that were never exposed to the female pheromones, taking phase dispersal into account. Therefore, it became necessary in free-running conditions to first determine the respective behavioral rhythm for each animal, and then to phase-align their activity patterns to allow for statistical analysis. Otherwise, circadian differences would average out in a phase-dispersed free-running population. As requested by the referees in point (7), we added RAIN to test for rhythmicity in each of our recordings and revised the manuscript accordingly.

      Furthermore, in preliminary experiments we briefly exposed hawkmoths to pheromone the night before the start of the experiment. However, we failed to obtain phase-synchronized spiking rhythms. Most likely, a circadian pattern of pheromone exposure would have been necessary as zeitgeber, which could not be used here due to long-term pheromone-dependent effects in spiking activity. These results are added as supplementary figure to Fig 3.

      (5) The tip recordings from long trichoid sensilla are critical aspects of this study. These recordings were carried out on upper sensillar tips located on the distal-most second annulus. Since there are approximately 80 annuli on the Manduca antennae, it is unclear whether the recordings are representative of the antennal response.

      We think the reviewers might have misinterpreted our description of the recording site. In the Methods, we state that we clip off the 20 most distal annuli (leaving a stump of about 60 annuli) and insert the reference electrode into the flagellum up to the second annulus from the cut end, i.e., the recording sites are located at 2/3 – 3/4 of the antenna length as seen from the head of the animal. We clarified this in the Methods section.

      In addition, our lab did show with antibody stainings against Orco that apparently all ORNs that innervate long and short trichoid sensilla along the whole flagellum express the same staining pattern (Nolte et al., 2016). Lee and Strausfeld (1990) mapped all types of antennal sensilla, and together with pheromone-dependent tip-recordings of Kaissling et al. (1989) it was shown that most of the male antennal sensilla are pheromone-sensitive long trichoid sensilla, with one of the two innervating ORNs always responding to bombykal, ensuring high sensitivity to pheromone detection. Furthermore, our patch clamp recordings of primary cell cultures of whole male antennae found largely overlapping ion channel populations across ORNs (review: (Stengl, 2010)). This would indicate that all ORNs, whether they express ORs sensitive to pheromone or general odorants, could potentially share the same Orco-dependent spontaneous activity rhythms. Furthermore, in our lab, different experimenters from different years that recorded from long trichoid sensilla on different annuli did not detect obvious differences in neither the spontaneous activity nor the pheromone responses (c.f., Dolzer et al., 2003; Gawalek and Stengl, 2018; Schneider et al., 2025). Thus, it is very likely that we are reporting a general encoding mechanism that is not locally restricted along the antennal flagellum and is very likely shared by all types of OR-Orco expressing ORNs.

      (6.1) The authors do not provide any data in support of their cAMP/cGMP-based Orco gating…

      There are publications supporting cyclic nucleotide gating of Orco in Drosophila, but only after previous phosphorylation via protein kinase C (PKC; review: (Wicher and Miazzi, 2021)). Since Orco is very conserved among insect species, it is likely that PKC- and cGMP/cAMP-dependent regulations are present for Orco in other insect species. To test this, we are currently characterizing second messenger-dependence of spontaneous spiking activity, which is the focus of a follow-up manuscript. Nevertheless, to provide more evidence for our hypothesis of the current manuscript, we added a new set of tip-recording experiments that demonstrate cAMP-dependent gating of Orco. Because of the addition of this figure, we merged figures 8-10 into Figure 8 and added the cAMP data as Figure 9.

      (6.2) … and the PTTF model proposed is somewhat disappointing.

      For a detailed introduction of our PTFL membrane clock hypothesis please see our opinion paper that we refer to in the manuscript (Stengl and Schneider, 2024). We added clarification of how Orco activation can influence cAMP levels. A more elaborate PTFL clock model including many more of the identified ion channels in hawkmoth ORNs is the focus of another manuscript to come.

      (6.3) The model seems to be influenced by their long-held proposal that insect olfactory signaling has a critical metabotropic component involving cyclic nucleotides, PKC, etc, a view that may be influenced by the use of Orco homomeric complexes generated in HEK cells.

      Indeed, we propose a metabotropic pheromone-transduction cascade, which in moths and cockroaches is based on G-protein-mediated activation of phospholipase C but not on adenylyl cyclase activation. Our hypothesis is not influenced by HEK cell heterologous expression studies of Orco but is supported by our own work comparing in vivo tip recordings of intact hawkmoths with patch clamp experiments on hawkmoth primary cell cultures of olfactory receptor neurons, which are able to respond to their species-specific pheromones in vitro (Schneider et al., 2025; Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In addition, a multitude of publications by other laboratories with in vivo and in vitro studies using physiological, genetic, and immunocytochemical assays all support a metabotropic signal transduction cascade in insect olfaction (Stengl, 2010; Stengl and Funk, 2013; Takagi et al., 2025; Wicher and Miazzi, 2021). In contrast, the hypothesis suggesting a solely ionotropic pheromone- and general odor-dependent transduction cascade for all insect species is based on very sparse experimental evidence, based primarily on heterologous expression studies such as HEK cells that lack the insect’s WT molecular surroundings, and thus, cannot predict OR-Orco function in vivo. Furthermore, the ionotropic hypothesis is heavily based upon the argument that an inverse 7TM receptor cannot couple to G-proteins, which lacks careful backup via biochemical and structural studies. In addition, the ionotropic hypothesis lacks support via carefully performed physiological in vivo studies in different insect species that paid attention to analysis of the distinct kinetic components of ORN´s odor/pheromone responses and that employ physiological concentrations and durations of odor/pheromone stimuli (please see our most recent publication by Schneider et al. (2025)). We added references to the possible odor transduction mechanisms to the introduction.

      (6.4) Nevertheless, structural studies on Orco do not support a cyclic nucleotide binding site, although PKC-based phosphorylation has been implicated in the fine-tuning/adaptation of olfactory signaling.

      While structural studies did not find evidence for conserved known cyclic nucleotide binding sites on Orco, this does not exclude the presence of indirect cAMP effects via e.g., Orco subunits complexing with other molecules under direct cAMP control, such as other ion channel subunits. Furthermore, it does not exclude so far unknown binding sites, or via sites that fold out only after a specific sequence of previous phosphorylations of the many phosphorylation sites on Orco. Indeed, physiological studies in Drosophila presented evidence for cyclic nucleotide dependence of Orco after previous PKC-dependent phosphorylation (Getahun et al., 2013). Our ongoing in vivo experiments in hawkmoths further corroborate a zeitgeber time-dependent PKC- and cyclic nucleotide-dependent modulation of Orco. These detailed studies will be published in a follow-up publication. In the revised version of this manuscript, we added tip-recording experiments that indicate cAMP involvement in Orco gating (new Figure 9).

      (7) Because only 5/11 LD and 7/10 DD animals showed daily rhythms, with averages lacking clear daily modulation, the methods are not sufficiently reliable enough to reveal novel underlying mechanisms of circadian rhythm generation. The reported results are therefore not yet reliable or quantifiable. To quantify their results, the authors should apply tests for circadian rhythmicity using methods such as RAIN, JTK CYCLE, MetaCycle, or Echo. The use of FFT and Wavelet is applauded, but these methods do not have tests of significance for rhythms and can be biased when analyzing data in which there could only be 1-3 circadian cycles. Because the conclusions appear to be based on 11-12 neurons that were recorded for 2-4 days, the reader is concerned that the methods are not yet perfected to provide strong evidence for circadian regulation of spontaneous firing of ORNs. The average data (e.g., Figure 3Bii and 3Cii) highlight the apparent lack of daily rhythms. In summary, the results would be more compelling if more than 50% of the recordings had significant circadian amplitudes and with similar periods and phases.

      The long-term tip-recordings of intact hawkmoths are very challenging and take a very long time to accomplish, thus, we are very happy that we succeeded in obtaining so many of them (N=40). We are thankful to the reviewers’ suggestion to use RAIN since this analysis revealed circadian rhythms in 7 of 11 LD recordings, 8 of 12 DD recordings, and 2 of 12 OLC15 recordings. Please see also our response to (4) above, commenting the phase-dispersal of activity rhythms observed in our experiments, as well as in the behavior of hawkmoth males in the mating cage.

      (8) The statement that circadian patterns of ORN firing are lost with the Orco antagonist (OLC15) is not strongly supported. The manuscript should be revised to quantify how Orco changed circadian amplitude in the 12 recorded neurons. Measures of circadian amplitude can avoid confusing/vague statements like Line 394 “low and high frequency bands appeared to merge during the activity phase around ZT 0 in the animals that showed clear circadian rhythms (N = 5 of 11 in LD)”. The conclusion that Orco blocks circadian firing appears to be contradicted by Figure 6, which indicates that ~6 of these neurons had circadian periods detected by wavelet. The manuscript would be strengthened with details about the specificity and reproducibility of the Orco antagonist. The authors quantify the gradual decrease in firing with the slope of a linear fit to estimate how the “effectiveness [of OLC15] increased over time.” They conclude that the drug “obliterated circadian rhythms and attenuated the spontaneous activity in several, but not all experiments (N = 8 of 12).” The report would be greatly strengthened with corroborating data from additional Orco antagonists and additional doses of OLC15 (the authors use only 50 uM OLC15).

      According to the valuable suggestions of the referees, we used RAIN to detect circadian rhythms in the spiking attributes in each individual animal. Since only 2 of 12 animals displayed a circadian rhythm in OLC15, statistical comparison of circadian amplitudes is not possible. We revised the results section accordingly and added to the figure legend to make it clearer that the heat maps in Fig 5 are representative from one animal each and not averages across animals.

      As the reviewer states correctly in (7), wavelet results of circadian rhythmicity must be interpreted carefully because of the low number of circadian cycles in ~3-4 day recordings. Since the heatmaps in Figure 5 visually revealed the presence of ultradian rhythms, the main focus of the wavelet analysis in Figure 6 is in the detection and quantification of ultradian periods up to 20 h.

      We revised the Methods section to include references to previous experiments that characterized the effect of different doses of OLC15 and other Orco antagonists and agonists in M. sexta antennae (Nolte et al., 2016). Please see also our response to (1).

      (9) The manuscript includes several statements that are more speculation than conclusion. For example, there is no evidence for tuning or plasticity in this report. Statements like the following should be removed or addressed with experiments that show changes in odor response specificity or sensitivity: "ORN signalosomes are highly plastic endogenous PTFL clocks comprising receptors for circadian and ultradian Zeitgebers that allow to tune into internal physiological and external environmental rhythms as basis for active sensing." (Discussion Line 622). The paper concludes that (line 380) "mean frequency of spontaneous spiking and the frequency of bursting expressed daily modulation, and are both most likely controlled via a circadian clock that targets the leak channel Orco." This is too bold given the available results.

      We revised the manuscript accordingly and clarified which statements are supported via published evidence and which are predictions based upon our novel hypothesis published in our opinion paper (Stengl and Schneider, 2024).

      (10.1) Because Orco conductance is modulated by cyclic nucleotides, it remains highly plausible that circadian regulation occurs upstream at the level of signaling pathways (e.g., calcium, calcium-binding proteins, GPCRs, cyclases, phosphodiesterases).

      We agree with the referees that it is very likely that there are multiple layers of interconnected feedback cycles that control Orco localization and activity. Our novel hypothesis suggests interlocked TTFL and PTFL control of physiological circadian rhythms, not strictly hierarchical TTFL control, which would require a daily turnover of membrane proteins and transcriptional control via the established TTFL clock in insect ORNs. We are currently searching for TTFL control at all levels of odor/pheromone transduction using ZT-dependent transcriptomics in combination with qPCR and single-nucleus transcriptomics, involving also all the molecules suggested by the referees. These studies are ongoing, are very time- and money-consuming, and are beyond the scope of this manuscript. However, we added a set of experiments to this manuscript in which we demonstrate that the effect of increased cAMP on the spontaneous spiking activity is mediated by Orco (new Figure 9).

      (10.2) The possibility that circadian oscillations of cyclic nucleotides are generated by the canonical TTFL mechanism has not been excluded. In fact, extensive work in Drosophila has demonstrated that the TTFL-based molecular clock proteins are required for circadian rhythms in olfaction.

      Our experiments that test circadian TTFL control at different levels of the cAMP transduction cascade in hawkmoth antennae are on the way and are part of another publication. In section 6.2 we already stated that our experiments do not exclude that Orco is under indirect control of the TTFL. We revised our discussion accordingly.

      The experiments published for TTFL dependent control of Drosophila olfaction that we are aware of (Krishnan et al., 1999; Tanoue et al., 2004) do not exclude interlinked PTFL and TTFL clocks. Krishnan et al. (1999) demonstrated that the TTFL clock in antennal olfactory receptor neurons correlates with circadian rhythms in odor responses measured in electroantennogram (EAG) recordings, not in single sensillum recordings as in our experiments. EAG recordings comprise not only voltage responses of the olfactory sensory neurons but also voltage changes generated in non-neuronal antennal cells such as trichogen and tormogen cells that built the transepithelial potential gradient via vATPases that generates the high K<sup>+</sup> concentration in the sensillum lymph (Jain et al., 2024; Klein, 1992; Thurm and Küppers, 1980). In addition, EAG recordings most likely contain responses of afferent neurons originating from somata in the brain that maintain central control of the antennae. Thus, EAG recordings are difficult to interpret.

      (11) A defining feature of circadian oscillators is the feedback mechanism that generates a time delay (e.g., PERIOD/TIMELESS repressing their own transcription). While the authors describe how cyclic nucleotides can regulate Orco conductance, they do not provide a convincing explanation of how Orco activity could, in turn, feed back into the proposed PTFL to sustain oscillations. For these reasons, the authors should consider:

      (a) Providing a broader discussion of non-TTFL models of circadian rhythms (e.g., redox cycles, post-translational modifications).

      We revised the discussion accordingly.

      (b) Reassessing Orco expression using a higher-resolution temporal sampling ({greater than or equal to}6 timepoints per 24 h).

      We added those experiments to the revised version of the manuscript (see our response to (2)).

      (c) Clarifying or revising the PTFL model to explicitly address how feedback would be achieved. Alternatively, the data may be more consistent with Orco conductance rhythms being regulated by post-translational mechanisms downstream of the canonical TTFL oscillator, as suggested by the Drosophila olfactory system literature.

      We added possible negative feedback elements to the Discussion to explain how our proposed PTFL could in principle work independent of TTFL clock.

      Minor weaknesses:

      (1) The authors should compare the firing patterns of ORN neurons to the bursts, clusters, and packets of retinal efferent spikes reported in Liu JS and Passaglia CL (2011; JBR). By comparing measures in moths to measures in Limulus, the authors might be able to address the question: Is the daily firing pattern of ORN neurons likely a conserved feature of circadian control of sensory sensitivity?

      We have revised the discussion accordingly.

      (2) The methods need further details. For example, it is unclear if or how single neuron activity was discriminated and whether the results were compromised by the relatively large environmental fluctuations in temperature (21-27oC), humidity (35-60%), or other cues known to modulate spontaneous firing.

      These large fluctuations stem from doing experiments at different seasons (higher temperature and humidity in the summer months, lower temperature and humidity in winter). Throughout each individual experiment, conditions were stable. We clarified the Methods section accordingly.

      Recommendations for the authors:

      The authors should post the code for their computational model to a repository like GitHub.

      The code for the computational model is now available at https://github.com/a-c-schneider/VijayanForlinoEtAl2025_Model.git

      References

      Benton R, Sachse S, Michnick SW, Vosshall LB. 2006. Atypical Membrane Topology and Heteromeric Function of Drosophila Odorant Receptors In Vivo. PLOS Biology 4:e20. DOI: https://doi.org/10.1371/journal.pbio.0040020

      Chen S, Luetje CW. 2012. Identification of New Agonists and Antagonists of the Insect Odorant Receptor Co-Receptor Subunit. PLOS ONE 7:e36784. DOI: https://doi.org/10.1371/journal.pone.0036784

      Dolzer J, Fischer K, Stengl M. 2003. Adaptation in pheromone-sensitive trichoid sensilla of the hawkmoth Manduca sexta. Journal of Experimental Biology 206:1575–1588. DOI: https://doi.org/10.1242/jeb.00302

      Dolzer J, Krannich S, Stengl M. 2008. Pharmacological Investigation of Protein Kinase C- and cGMP-Dependent Ion Channels in Cultured Olfactory Receptor Neurons of the Hawkmoth Manduca sexta. Chemical Senses 33:803–813. DOI: https://doi.org/10.1093/chemse/bjn043

      Dolzer J, Schröder K, Stengl M. 2021. Cyclic nucleotide-dependent ionic currents in olfactory receptor neurons of the hawkmoth Manduca sexta suggest pull–push sensitivity modulation. European Journal of Neuroscience 54:4804–4826. DOI: https://doi.org/10.1111/ejn.15346

      Gawalek P, Stengl M. 2018. The Diacylglycerol Analogs OAG and DOG Differentially Affect Primary Events of Pheromone Transduction in the Hawkmoth Manduca sexta in a Zeitgebertime-Dependent Manner Apparently Targeting TRP Channels. Frontiers in Cellular Neuroscience 12:218. DOI: https://doi.org/10.3389/fncel.2018.00218

      Getahun MN, Olsson SB, Lavista-Llanos S, Hansson BS, Wicher D. 2013. Insect Odorant Response Sensitivity Is Tuned by Metabotropically Autoregulated Olfactory Receptors. PLOS ONE 8:e58889. DOI: https://doi.org/10.1371/journal.pone.0058889

      Ghosh S, Suray C, Bozzolan F, Palazzo A, Monsempès C, Lecouvreur F, Chatterjee A. 2024. Pheromone-mediated command from the female to male clock induces and synchronizes circadian rhythms of the moth Spodoptera littoralis. Current biology 34:1414-1425.e5. DOI: https://doi.org/10.1016/j.cub.2024.02.042, PMID: 38479388

      Jain K, Prelic S, Hansson BS, Wicher D. 2024. Expression of Drosophila melanogaster V-ATPases in Olfactory Sensillum Support Cells. Insects 15:1016. DOI: https://doi.org/10.3390/insects15121016

      Jones PL, Pask GM, Rinker DC, Zwiebel LJ. 2011. Functional agonism of insect odorant receptor ion channels. Proceedings of the National Academy of Sciences 108:8821–8825. DOI: https://doi.org/10.1073/pnas.1102425108

      Kaissling KE, Hildebrand JG, Tumlinson JH. 1989. Pheromone receptor cells in the male moth Manduca sexta. Archives of Insect Biochemistry and Physiology 10:273–279. DOI: https://doi.org/10.1002/arch.940100403

      Klein U. 1992. The insect V-ATPase, a plasma membrane proton pump energizing secondary active transport: immunological evidence for the occurrence of a V-ATPase in insect ion-transporting epithelia. Journal of Experimental Biology 172:345–354. DOI: https://doi.org/10.1242/jeb.172.1.345

      Krannich S, Stengl M. 2008. Cyclic Nucleotide-Activated Currents in Cultured Olfactory Receptor Neurons of the Hawkmoth Manduca sexta. Journal of Neurophysiology 100:2866–2877. DOI: https://doi.org/10.1152/jn.01400.2007

      Krishnan B, Dryer SE, Hardin PE. 1999. Circadian rhythms in olfactory responses of Drosophila melanogaster. Nature 400:375–378. DOI: https://doi.org/10.1038/22566

      Lee JK, Strausfeld NJ. 1990. Structure, distribution and number of surface sensilla and their receptor cells on the olfactory appendage of the male mothManduca sexta. Journal of Neurocytology 19:519–538. DOI: https://doi.org/10.1007/BF01257241

      Merlin C, Lucas P, Rochat D, François M-C, Maïbèche-Coisne M, Jacquin-Joly E. 2007. An Antennal Circadian Clock and Circadian Rhythms in Peripheral Pheromone Reception in the Moth Spodoptera littoralis. Journal of Biological Rhythms 22:502–514. DOI: https://doi.org/10.1177/0748730407307737

      Nolte A, Funk NW, Mukunda L, Gawalek P, Werckenthin A, Hansson BS, Wicher D, Stengl M. 2013. In situ Tip-Recordings Found No Evidence for an Orco-Based Ionotropic Mechanism of Pheromone-Transduction in Manduca sexta. PLOS ONE 8:e62648. DOI: https://doi.org/10.1371/journal.pone.0062648

      Nolte A, Gawalek P, Koerte S, Wei H, Schumann R, Werckenthin A, Krieger J, Stengl M. 2016. No Evidence for Ionotropic Pheromone Transduction in the Hawkmoth Manduca sexta. PLOS ONE 11:e0166060. DOI: https://doi.org/10.1371/journal.pone.0166060

      Rymer J, Bauernfeind AL, Brown S, Page TL. 2007. Circadian rhythms in the mating behavior of the cockroach, Leucophaea maderae. Journal of Biological Rhythms 22:43–57. DOI: https://doi.org/10.1177/0748730406295462, PMID: 17229924

      Schendzielorz J, Schendzielorz T, Arendt A, Stengl M. 2014. Bimodal Oscillations of Cyclic Nucleotide Concentrations in the Circadian System of the Madeira Cockroach Rhyparobia maderae. Journal of Biological Rhythms 29:318–331. DOI: https://doi.org/10.1177/0748730414546133

      Schendzielorz T, Peters W, Boekhoff I, Stengl M. 2012. Time of Day Changes in Cyclic Nucleotides Are Modified via Octopamine and Pheromone in Antennae of the Madeira Cockroach. Journal of Biological Rhythms 27:388–397. DOI: https://doi.org/10.1177/0748730412456265

      Schendzielorz T, Schirmer K, Stolte P, Stengl M. 2015. Octopamine Regulates Antennal Sensory Neurons via Daytime-Dependent Changes in cAMP and IP3 Levels in the Hawkmoth Manduca sexta. PLOS ONE 10:e0121230. DOI: https://doi.org/10.1371/journal.pone.0121230

      Schneider AC, Schröder K, Chang Y, Nolte A, Gawalek P, Stengl M. 2025. Hawkmoth Pheromone Transduction Involves G-Protein–Dependent Phospholipase Cβ Signaling. eNeuro 12:ENEURO.0376-24.2024. DOI: https://doi.org/10.1523/ENEURO.0376-24.2024, PMID: 39880675

      Stengl M. 2010. Pheromone Transduction in Moths. Frontiers in Cellular Neuroscience 4:133. DOI: https://doi.org/10.3389/fncel.2010.00133

      Stengl M. 1994. Inositol-trisphosphate-dependent calcium currents precede cation currents in insect olfactory receptor neurons in vitro. Journal of Comparative Physiology A 174:187–194. DOI: https://doi.org/10.1007/BF00193785

      Stengl M. 1993. Intracellular-Messenger-Mediated Cation Channels in Cultured Olfactory Receptor Neurons. Journal of Experimental Biology 178:125–147. DOI: https://doi.org/10.1242/jeb.178.1.125

      Stengl M, Funk NW. 2013. The role of the coreceptor Orco in insect olfactory transduction. Journal of Comparative Physiology A 199:897–909. DOI: https://doi.org/10.1007/s00359-013-0837-3

      Stengl M, Hildebrand JG. 1990. Insect olfactory neurons in vitro: morphological and immunocytochemical characterization of male-specific antennal receptor cells from developing antennae of male Manduca sexta. Journal of Neuroscience 10:837–847. DOI: https://doi.org/10.1523/JNEUROSCI.10-03-00837.1990, PMID: 2319305

      Stengl M, Schneider AC. 2024. Contribution of membrane-associated oscillators to biological timing at different timescales. Frontiers in Physiology 14:1243455. DOI: https://doi.org/10.3389/fphys.2023.1243455

      Takagi S, Abuin L, Mermet J, Lee D, Benton R. 2025. A GPCR signaling pathway in insect odor detection. DOI: https://doi.org/10.1101/2025.10.03.680299

      Tanoue S, Krishnan P, Krishnan B, Dryer SE, Hardin PE. 2004. Circadian Clocks in Antennal Neurons Are Necessary and Sufficient for Olfaction Rhythms in Drosophila. Current Biology 14:638–649. DOI: https://doi.org/10.1016/j.cub.2004.04.009, PMID: 15084278

      Thurm U, Küppers J. 1980. Epithelial physiology of insect sensilla. In: Locke M, Smith DS (Eds). Insect Biology in the Future. Academic Press. p. 735–763. DOI: https://doi.org/10.1016/B978-0-12-454340-9.50039-2

      Wicher D, Miazzi F. 2021. Functional properties of insect olfactory receptors: ionotropic receptors and odorant receptors. Cell and Tissue Research 383:7–19. DOI: https://doi.org/10.1007/s00441-020-03363-x

    1. eLife Assessment

      This study addresses an important question in aging biology by combining metabolic, genetic, and functional approaches to examine how cytosolic acetyl-CoA metabolism influences late-life fitness in replicatively aging yeast. The evidence supporting the roles of AMPK activation, mitochondrial acetyl-CoA utilization, and fatty acid synthesis in shaping distinct aging-associated phenotypes is convincing overall, with the engineered A2A strain providing a particularly elegant demonstration of coordinated metabolic regulation. However, several conclusions would benefit from clarification or moderation, particularly regarding the relationship between late-life fitness and replicative lifespan, the interpretation of "senescence," the proposed existence of distinct aging subpopulations, and the extent to which the data support mechanistic claims about lipid starvation, acetyl-CoA excess, and chromatin-based aging pathways.

    2. Reviewer #1 (Public review):

      This rigorous and creative study uses an elegant combination of metabolomics, transcriptomics, and budding yeast molecular genetics to discover that (i) activating AMPK to maintain mitochondrial respiration fueled by cytosolic Acetyl CoA and (ii) increasing fatty acid synthesis independent of respiration drive independent pathways that increase the fitness of replicatively-aged budding yeast cells, albeit without increasing their lifespan. This work will be of interest to scientists in the field of aging and metabolism. Some clarifications in the text would address the following concerns, which would increase the impact of the study:

      (1) What does activation of AMPK (via PGDP-Sak1 expression) do to the replicative lifespan? How many bud scars, in general, do the subpopulations that are older - yet have less Tom70 (increased mitochondrial fitness) - have, after the 48 hrs time point that they are examining? How many divisions occurred in this 48hr time period - i.e. is it long enough to have all cells reach the end of their replicative lifespan? This information is important to rule out that a subset of the mutant cells just divided faster and hence had more divisions within 48 hrs (growing faster and living longer are different things). Having identical growth curves doesn't indicate per se that they all divide at the same rate, as there may be a subpopulation that divides faster and a subpopulation that doesn't grow so well.

      (2) A2A cells do not have an extended replicative lifespan (RLS) but show an increase in the "low senescence" population (Figure 2). If the cells are not becoming senescent, why don't they have longer RLS? Not having a longer lifespan seems inconsistent with the statement that "bud scar counting confirmed that A2A cells reach a higher age than wild type", which comes back to how many times the cells can divide in the 48hr timepoint studied and their rate of cell division? Also, the lifespan curve shown is plotted against time, not cell division number, which does not take into account different division times of cells within the population (described above). It would be much more useful to show standard lifespan curves showing cell division numbers per lifespan per cell.

      (3) Increased "fitness" of the old cells is implied from the increased size of the colonies that the old cells can make. However, this is a measure of the fitness of the daughters per se, not the old mother cells. Are the old mothers just passing on healthier mitochondria and more lipids to the daughters, such that they can divide more times? If the aged cells have an "increased fitness", why don't they divide more times themselves (i.e. live longer?).

      (4) The statement is made that "these experiments define two classes of aging cells with distinct metabolic needs, coherent with the model of two aging trajectories previously proposed (referencing Nan Hao's work)". However, the big difference here is that in Nan Hao's work, their two aging trajectories influenced the length of lifespan, but that does not appear to be the case here. That distinction should be made clear. Perhaps the authors could also speculate as to why the A2A yeast stops dividing after presumably the same number of cell divisions, even though they have an activated AMPK and activated fatty acid synthesis pathway.

      (5) I am a bit confused by the use of the word "senescence" by this lab here and in their previous growth on galactose studies. If yeast don't senesce, which is usually defined as an irreversible arrest of the cell cycle where cells stop dividing, shouldn't the yeast that do not senesce still be dividing and hence have a longer lifespan? Should a different term be used rather than senescence? Such as "fitness late in life". The authors giving their definition of senescence may help reduce this apparent contradiction.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors investigate how cytosolic acetyl-CoA metabolism influences replicative aging in budding yeast. They propose that acetyl-CoA regulates aging through three major pathways: (1) mitochondrial transport to support mitochondrial function, (2) fatty acid synthesis, and (3) global protein acetylation. The data show that AMPK activation promotes mitochondrial import of acetyl-CoA and partially mitigates mitochondrial decline in a subset of aging cells.

      Furthermore, the engineered A2A strain, which enhances mitochondrial acetyl-CoA utilization while relieving inhibition of fatty acid synthesis, increases the proportion of cells exhibiting a "low senescence" phenotype.

      Overall, this is a thoughtful and potentially impactful study that advances our understanding of metabolic control of aging. Addressing the points below, particularly by refining interpretations and, where feasible, incorporating additional analyses, will further strengthen the manuscript and its conclusions.

      Strengths:

      The study has several notable strengths. It addresses an important question by shifting the focus from lifespan to preservation of late-life fitness, which is highly relevant to aging biology. The work integrates metabolic, genetic, and functional analyses to link cytosolic acetyl-CoA flux with distinct aging outcomes, and the engineering of the A2A strain provides a clear and elegant demonstration of how coordinated pathway modulation can improve cellular fitness.

      Weaknesses:

      (1) While the manuscript focuses on mitochondrial transport and fatty acid synthesis, cytosolic acetyl-CoA is also a key regulator of histone acetylation and chromatin silencing. It would strengthen the study to consider whether acetyl-CoA depletion contributes to improved fitness through enhanced rDNA silencing. Given the well-established role of rDNA instability in yeast aging, additional experiments examining rDNA silencing and stability would be valuable. For example, monitoring rDNA copy number changes (not necessarily ERCs) under AMPK activation, oleic acid supplementation, and in the A2A strain, similar to approaches used in the authors' prior work, would help clarify whether chromatin regulation contributes to the observed phenotypes.

      (2) The current data do not fully distinguish whether AMPK activation and oleic acid supplementation act on distinct subpopulations of aging cells. An alternative explanation is that oleic acid supplementation enhances mitochondrial function and acts additively with AMPK activation, thereby increasing the fraction of cells in the "low senescence" state. Since this distinction is not central to the main conclusions, I suggest softening the language around subpopulation specificity. Emphasizing instead that the A2A strain coordinately modulates multiple branches of acetyl-CoA metabolism to improve late-life fitness would maintain the strength of the central message without overinterpretation.

      (3) The manuscript proposes that lipid starvation and excess acetyl-CoA are major drivers of senescence in distinct subpopulations of wild-type aging cells. This conclusion is not yet fully supported by the presented data. Direct measurements of age-dependent divergence in acetyl-CoA and fatty acid levels at the single-cell level would be needed to substantiate this model. Based on the current evidence, a more conservative interpretation would be that aging cells exhibit differential sensitivity to perturbations in acetyl-CoA and lipid metabolism. Accordingly, I recommend revising the statement in the Abstract ("We further implicate lipid starvation and excess acetyl coenzyme A availability as major drivers of senescence...") and the corresponding discussion text to better align with the data.

    4. Reviewer #3 (Public review):

      Summary:

      These findings suggest that PGPD-SAK1 yeast show a subpopulation with lowered TOM70-GFP expression in high bud scar staining aged cells. Deletion of CAT2 or MLS1 reduces this effect. A PGPD-SAK1 acc1S1157A double mutant (called "A2A" here) shows an even larger effect of lowered tom70 expression in high bud scar staining aged cells. Utilization of various additional mutants involved in acetyl-CoA transport, carnitine shuttle, respiration, etc., leads the authors to conclude that these shifts in TOM70-GFP in aged cells are linked to the AMPK-fatty acid metabolic regulatory system.

      Strengths:

      These extensive and clearly described experiments reveal interesting changes in TOM70-GFP intensity in subsets of aged yeast in several mutants eventually identified as linked to the AMPK-fatty acid metabolic regulatory system.

      Weaknesses:

      (1) 3 biological replicates for mRNASeq is low.

      (2) While "Traditional conceptions of ageing implicate a progressive accumulation of damage leading to systemic degradation in performance until death, with evolutionary pressures acting to maximise early life fitness and fecundity at the expense of ageing health." is tangential perhaps to the data and conclusions of the study, both claims of this sentence are at best controversial, and the manuscript is no weaker for their omission.

      (3) The statement that "Here, we determine the basis of senescence and fitness loss in replicatively ageing yeast" is a bit strong as a summary of the present careful work presented here. If the authors had created yeast mutants that retained fitness indefinitely, this would be a more appropriate strength of claim to summarize the work.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      This rigorous and creative study uses an elegant combination of metabolomics, transcriptomics, and budding yeast molecular genetics to discover that (i) activating AMPK to maintain mitochondrial respiration fueled by cytosolic Acetyl CoA and (ii) increasing fatty acid synthesis independent of respiration drive independent pathways that increase the fitness of replicatively-aged budding yeast cells, albeit without increasing their lifespan. This work will be of interest to scientists in the field of aging and metabolism. Some clarifications in the text would address the following concerns, which would increase the impact of the study:

      (1) What does activation of AMPK (via PGDP-Sak1 expression) do to the replicative lifespan? How many bud scars, in general, do the subpopulations that are older - yet have less Tom70 (increased mitochondrial fitness) - have, after the 48 hrs time point that they are examining? How many divisions occurred in this 48hr time period - i.e. is it long enough to have all cells reach the end of their replicative lifespan? This information is important to rule out that a subset of the mutant cells just divided faster and hence had more divisions within 48 hrs (growing faster and living longer are different things). Having identical growth curves doesn't indicate per se that they all divide at the same rate, as there may be a subpopulation that divides faster and a subpopulation that doesn't grow so well.

      Increasing AMPK activity increases replicative lifespan [PMID: 25869125], but given our finding that AMPK activation splits the population, such replicative lifespan assays are hard to interpret. Bud scar counts have a similar issue. Hence we restricted the lifespan and bud scar analyses to wt and A2A which are more homogenous (Figures S2 B and E). A2A cells at 48h have ~25% more bud scars than wt cells. Yes, by 48h most of the cells have lost viability (Figure 2E). The reviewer is correct that you can't properly compare the lifespan curves if the cells divide at different rates, hence our follow-up test of wt at 48h vs A2A at 40h viability after we had confirmed that these timepoints captured cells at equivalent replicative ages (Figure 2D,E). This shows that viability of A2A is slightly lower than wt at matched age, indicating a slightly shorter lifespan.

      (2) A2A cells do not have an extended replicative lifespan (RLS) but show an increase in the "low senescence" population (Figure 2). If the cells are not becoming senescent, why don't they have longer RLS? Not having a longer lifespan seems inconsistent with the statement that "bud scar counting confirmed that A2A cells reach a higher age than wild type", which comes back to how many times the cells can divide in the 48hr timepoint studied and their rate of cell division? Also, the lifespan curve shown is plotted against time, not cell division number, which does not take into account different division times of cells within the population (described above). It would be much more useful to show standard lifespan curves showing cell division numbers per lifespan per cell.

      Our observation that cells can reach the end of life without senescing is consistent with other studies that have studied the life course of individual cells by microscopy [PMID: 31291577, 32675375]. These studies always highlight some proportion of the cells that reach the end of life with no or minimal senescence, though this fraction varies with the experimental system. The question of why cells lose viability without senescing is a complete unknown in the field, but reflects a wider lack of consensus as to why yeast lose viability with replicative age.

      We are wary about making strong statements on lifespan for exactly the reason the reviewer picks out. In liquid culture we can only assess viability over time, and it is clear from the comparison of liquid and solid media lifespans performed by the Gottschling lab [PMID: 19652178] that culture system has a huge effect on lifespan, with cells in classical microdissection-based lifespan assays living far longer than they do in liquid. This of course means that classical microdissection assays are not very useful for A2A so we are left with an unsatisfactory approximation. We have therefore restricted our conclusion on lifespan to simply say that lifespan of A2A cells is not extended which our data in Figures 2D,E,S2B does support (see also answer to Q1), and therefore with the majority of A2A cells showing low senescence marks and high fitness at 48h we can conclude that lifespan and fitness loss must be separable.

      We will note these limitations of lifespan measurements in the manuscript.

      (3) Increased "fitness" of the old cells is implied from the increased size of the colonies that the old cells can make. However, this is a measure of the fitness of the daughters per se, not the old mother cells. Are the old mothers just passing on healthier mitochondria and more lipids to the daughters, such that they can divide more times? If the aged cells have an "increased fitness", why don't they divide more times themselves (i.e. live longer?).

      Yes, colony growth speed is defined by daughter cell replication, and as long as the daughters and subsequent generations divide at the same rate irrespective of whether they come from a young or old mothers then the size of the colony after 24 hours varies based on the time it took the initial mother to produce a daughter. This is what the assay really measures. We note that aged wildtype mothers often do not divide at all in the first 24 hours after being put on an agar plate (hence the tiny reported colony size), even though they do eventually produce a daughter which then forms a colony, whereas A2A cells tend to produce the first daughter rapidly whether young or old. It is known that daughters of aged wildtype mothers also divide slower, which will also contribute to differences in colony size, and this may well result from a lipid and/or mitochondrial contribution, but the primary driver of colony size in 24 hours is the time the mother took to initially divide. We will add this detail to the manuscript.

      As noted above, the mechanistic basis of lifespan is unknown, but although senescence can shorten lifespan, our work and that of others shows that lifespan is still limited in the absence of senescence.

      (4) The statement is made that "these experiments define two classes of aging cells with distinct metabolic needs, coherent with the model of two aging trajectories previously proposed (referencing Nan Hao's work)". However, the big difference here is that in Nan Hao's work, their two aging trajectories influenced the length of lifespan, but that does not appear to be the case here. That distinction should be made clear. Perhaps the authors could also speculate as to why the A2A yeast stops dividing after presumably the same number of cell divisions, even though they have an activated AMPK and activated fatty acid synthesis pathway.

      We will add this distinction. As noted above, we are wary of making strong statements regarding lifespan as the assays we can do in liquid culture are limited. We are therefore similarly wary about speculating about causes for the lack of lifespan difference because in reality all we can do is rule out a big effect. We would love to speculate on why the A2A cells don't have an extended lifespan, but at this point we don't have any good ideas on this point!

      (5) I am a bit confused by the use of the word "senescence" by this lab here and in their previous growth on galactose studies. If yeast don't senesce, which is usually defined as an irreversible arrest of the cell cycle where cells stop dividing, shouldn't the yeast that do not senesce still be dividing and hence have a longer lifespan? Should a different term be used rather than senescence? Such as "fitness late in life". The authors giving their definition of senescence may help reduce this apparent contradiction.

      We completely agree, this is confusing and noted this distinction in the Introduction. Use of the term senescence to mean a loss of fitness late in life in yeast stems from the classical definition of senescence as applied to whole organisms. However, the term senescence as applied to cells has a more specific meaning in terms of the cell cycle as the reviewer notes. As an individual S. cerevisiae is both a cell and an organism, the terminology clashes. However, the marker we largely employ (Tom70-GFP) which in our hands is a very good proxy for fitness was originally defined as marking the senescence entry point (SEP), so overall we feel we can't avoid the term.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors investigate how cytosolic acetyl-CoA metabolism influences replicative aging in budding yeast. They propose that acetyl-CoA regulates aging through three major pathways: (1) mitochondrial transport to support mitochondrial function, (2) fatty acid synthesis, and (3) global protein acetylation. The data show that AMPK activation promotes mitochondrial import of acetyl-CoA and partially mitigates mitochondrial decline in a subset of aging cells.

      Furthermore, the engineered A2A strain, which enhances mitochondrial acetyl-CoA utilization while relieving inhibition of fatty acid synthesis, increases the proportion of cells exhibiting a "low senescence" phenotype.

      Overall, this is a thoughtful and potentially impactful study that advances our understanding of metab to olic control of aging. Addressing the points below, particularly by refining interpretations and, where feasible, incorporating additional analyses, will further strengthen the manuscript and its conclusions.

      Strengths:

      The study has several notable strengths. It addresses an important question by shifting the focus from lifespan to preservation of late-life fitness, which is highly relevant to aging biology. The work integrates metabolic, genetic, and functional analyses to link cytosolic acetyl-CoA flux with distinct aging outcomes, and the engineering of the A2A strain provides a clear and elegant demonstration of how coordinated pathway modulation can improve cellular fitness.

      Weaknesses:

      (1) While the manuscript focuses on mitochondrial transport and fatty acid synthesis, cytosolic acetyl-CoA is also a key regulator of histone acetylation and chromatin silencing. It would strengthen the study to consider whether acetyl-CoA depletion contributes to improved fitness through enhanced rDNA silencing. Given the well-established role of rDNA instability in yeast aging, additional experiments examining rDNA silencing and stability would be valuable. For example, monitoring rDNA copy number changes (not necessarily ERCs) under AMPK activation, oleic acid supplementation, and in the A2A strain, similar to approaches used in the authors' prior work, would help clarify whether chromatin regulation contributes to the observed phenotypes.

      We have data addressing this point that we will add to the manuscript. In short, we see no difference in gene expression from Sir2-repressed sub-telomeric regions or MAT loci, but the genome-wide gene expression dysregulation associated with age is partially suppressed in PGPD-SAK1. However, A2A does not suppress this further, so it is not critical for the suppression of senescence in A2A though we are following this up. ERC accumulation is higher in A2A at 48h, consistent with the cells being older, meaning that ERCs are unlinked to senescence onset as we have previously reported. There is a strong upregulation of transcripts from Sir2-repressed rDNA intergenic spacers with age in all genotypes, but we attribute this simply to the copy number increase of these regions on ERCs rather than a defect in silencing. We have previously looked for heritable changes in rDNA copy number arising during ageing and found (to our surprise) absolutely nothing, so we don't expect any changes under these conditions.

      (2) The current data do not fully distinguish whether AMPK activation and oleic acid supplementation act on distinct subpopulations of aging cells. An alternative explanation is that oleic acid supplementation enhances mitochondrial function and acts additively with AMPK activation, thereby increasing the fraction of cells in the "low senescence" state. Since this distinction is not central to the main conclusions, I suggest softening the language around subpopulation specificity. Emphasizing instead that the A2A strain coordinately modulates multiple branches of acetyl-CoA metabolism to improve late-life fitness would maintain the strength of the central message without overinterpretation.

      We agree that oleic acid and the lipids produced downstream of Acc1 in A2A may improve late life fitness via enhanced mitochondrial function, and in support of this Oxygen Consumption Rate is marginally (though significantly) higher in A2A than PGPD-SAK1. We will add this data to the manuscript. However, we disagree with the interpretation of an additive effect as we report throughout the study that AMPK activation and lipid biosynthesis/supplementation affect different sub-populations of cells. We do not observe populations of intermediate senescence cells, rather by flow cytometry and fitness assays we observe individual cells in binary low senescence or high senescence states.

      (3) The manuscript proposes that lipid starvation and excess acetyl-CoA are major drivers of senescence in distinct subpopulations of wild-type aging cells. This conclusion is not yet fully supported by the presented data. Direct measurements of age-dependent divergence in acetyl-CoA and fatty acid levels at the single-cell level would be needed to substantiate this model. Based on the current evidence, a more conservative interpretation would be that aging cells exhibit differential sensitivity to perturbations in acetyl-CoA and lipid metabolism. Accordingly, I recommend revising the statement in the Abstract ("We further implicate lipid starvation and excess acetyl coenzyme A availability as major drivers of senescence...") and the corresponding discussion text to better align with the data.

      We agree and will adjust the abstract to make it clearer that the lipid starvation / excess acetyl coA interpretation is a model.

      Reviewer #3 (Public review):

      Summary:

      These findings suggest that PGPD-SAK1 yeast show a subpopulation with lowered TOM70-GFP expression in high bud scar staining aged cells. Deletion of CAT2 or MLS1 reduces this effect. A PGPD-SAK1 acc1S1157A double mutant (called "A2A" here) shows an even larger effect of lowered tom70 expression in high bud scar staining aged cells. Utilization of various additional mutants involved in acetyl-CoA transport, carnitine shuttle, respiration, etc., leads the authors to conclude that these shifts in TOM70-GFP in aged cells are linked to the AMPK-fatty acid metabolic regulatory system.

      Strengths:

      These extensive and clearly described experiments reveal interesting changes in TOM70-GFP intensity in subsets of aged yeast in several mutants eventually identified as linked to the AMPK-fatty acid metabolic regulatory system.

      Weaknesses:

      (1) 3 biological replicates for mRNASeq is low.

      Thank you for pointing this out. We performed another replicate after posting the initial preprint but didn’t update the figure in the eLIFe-reviewed version. We will add this to the scatter plots and analysis in Figure 1, the findings have not changed.

      (2) While "Traditional conceptions of ageing implicate a progressive accumulation of damage leading to systemic degradation in performance until death, with evolutionary pressures acting to maximise early life fitness and fecundity at the expense of ageing health." is tangential perhaps to the data and conclusions of the study, both claims of this sentence are at best controversial, and the manuscript is no weaker for their omission.

      We actually feel that this sentence is very important to the message of the manuscript, which is that ageing does not necessarily have to involve a loss of fitness before death. Ageing is often described as the progressive wearing out of components leading to decline and death (with an old car often used as an analogy); in the ageing field this is certainly controversial, but outside the field this remains the normal understanding. We think it is important to state this widely held viewpoint with which our findings are hard to reconcile.

      Our interpretation that yeast are bet-hedging as a population growth strategy and this drives ageing in the long term is a classic antagonistic pleiotropy - we will add this term (from the citation that is already in the manuscript) and clarify in the discussion to make it obvious why we are introducing this concept in the introduction.

      (3) The statement that "Here, we determine the basis of senescence and fitness loss in replicatively ageing yeast" is a bit strong as a summary of the present careful work presented here. If the authors had created yeast mutants that retained fitness indefinitely, this would be a more appropriate strength of claim to summarize the work.

      Indeed - we will refine this sentence.

    1. eLife Assessment

      This important study provides convincing data suggesting that subcellular localization of the spatial regulator of cell division, MinD, is an intrinsic feature of the protein's ability to associate with the membrane as both a dimer and a monomer. These findings distinguish the behavior of MinD in B. subtilis from its counterpart in E. coli and suggest that there is not a need to invoke additional localization factors. The reviewers felt that the revisions, particularly the additional experiments and changes to the text to make the experimental design and conclusions clearer, improve the quality of the manuscript.

    2. Reviewer #1 (Public review):

      [Editor's note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      In this work the authors investigate the molecular dynamics of MinD, a component of the Bacillus subtilis Min system, in vitro and in vivo. In Escherichia coli the Min system is highly dynamic and displays rapid pole to pole oscillation whereby a time average minimum of the Min proteins at mid cell is established. However, in B. subtilis, this is not the case, and there is no MinE present. MinD in B. subtilis dynamically relocalizes from the poles to division sites, and binds to MinC and MinJ, which mediates its interaction with DivIVA. This paper reports biochemical characterization of B. subtilis MinD in vitro and dynamics of MinD variants in vivo, providing mechanistic insight into the mechanism of dynamic localization.

      Strengths:

      In the current study, the authors perform a detailed biochemical characterization of the in vitro ATPase activity of MinD and demonstrate that rapid hydrolysis is elicited by adding phospholipids. They further show using a collection of substitution mutants of MinD that both monomers and dimers bind to the membrane, and ATP occupancy changes the on and off rates. Identification, quantification, and tracking of discrete Halo-MinD populations was nicely done and showed that mutations in MinD alter dynamic localization, correlating with PL binding on and off rates in vitro.

      In the revised manuscript, the authors now demonstrate localization and tracking data for minC and minJ deletion strains, which suggest that MinJ impacts MinD membrane cycling, but MinC does not. Additional in vitro work showed that the PDZ domain of MinJ modifies MinD ATP hydrolysis rates, and the authors propose that MinJ may promote MinD dimer formation.

      Weaknesses of the revised version: No major weaknesses.

    3. Reviewer #2 (Public review):

      Summary:

      Feddersen & Bramkamp determined important characteristics of how MinD protein binds/dissociates to/from the membrane, and dimerizes in relation to its ATPase activity. The presented data clearly shows the differences in function of MinD homologs from B. subtilis and E. coli.

      Strengths:

      The work presents well-executed experiments that lead to interesting conclusions and a new model of how Min system works during B. subtilis mid-cell division. Importantly, this model is supported by in vitro characterization of well-chosen mutants in the functional domains of MinD. Outstandingly, most of the in vitro data are confirmed by single-molecule localization microscopy.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this work the authors investigate the molecular dynamics of MinD, a component of the Bacillus subtilis Min system, in vitro and in vivo. In Escherichia coli the Min system is highly dynamic and displays rapid pole to pole oscillation whereby a time average minimum of the Min proteins at mid cell is established. However, in B. subtilis, this is not the case, and there is no MinE present. MinD in B. subtilis dynamically relocalizes from the poles to division sites, and binds to MinC and MinJ, which mediates its interaction with DivIVA. This paper reports biochemical characterization of B. subtilis MinD in vitro and dynamics of MinD variants in vivo, providing mechanistic insight into the mechanism of dynamic localization.

      Strengths:

      In the current study, the authors perform a detailed biochemical characterization of the in vitro ATPase activity of MinD and demonstrate that rapid hydrolysis is elicited by adding phospholipids. They further show using a collection of substitution mutants of MinD that both monomers and dimers bind to the membrane, and ATP occupancy changes the on and off rates. Identification, quantification, and tracking of discrete Halo-MinD populations was nicely done and showed that mutations in MinD alter dynamic localization, correlating with PL binding on and off rates in vitro.

      In the revised manuscript, the authors now demonstrate localization and tracking data for minC and minJ deletion strains, which suggest that MinJ impacts MinD membrane cycling, but MinC does not. Additional in vitro work showed that the PDZ domain of MinJ modifies MinD ATP hydrolysis rates, and the authors propose that MinJ may promote MinD dimer formation.

      Weaknesses of the revised version: No major weaknesses.

      We thank this reviewer for the positive evaluation of our manuscript and the precise summary of our findings.

      Reviewer #2 (Public review):

      Summary:

      Feddersen & Bramkamp determined important characteristics of how MinD protein binds/dissociates to/from the membrane, and dimerizes in relation to its ATPase activity. The presented data clearly shows the differences in function of MinD homologs from B. subtilis and E. coli.

      Strengths:

      The work presents well-executed experiments that lead to interesting conclusions and a new model of how Min system works during B. subtilis mid-cell division. Importantly, this model is supported by in vitro characterization of well-chosen mutants in the functional domains of MinD. Outstandingly, most of the in vitro data are confirmed by single-molecule localization microscopy.

      Weaknesses:

      The authors immobilized liposomes, for which they used E. coli total lipids, to measure ATPase activity and liposome association and dissociation of B. subtilis MinD. For these experiments would be more suitable to use B. subtilis total lipids as more biologically relevant data could be gained.

      Although the work is in detail and nicely compares the function of B. subtilis Min system with E. coli Min system, it lacks the comparison of the Min system function in other rod-shaped Gram-positive bacteria. I would suggest including in the Discussion the complexity of other Min systems. Especially, this complexity is seen in other rod-shaped and spore formers such as Clostridial species in which one of these Min systems or both are present, an oscillating E. coli Min system type and more static as in B. subtilis.

      Comments on revisions:

      I'm satisfied with the authors response to my private recommendation points. However, I thought that they would also respond to my points mentioned in Public Review part, weaknesses as shown above and update the revised version accordingly.

      We are very grateful to the reviewer for the positive comments and fully agree with the points raised. Due to the overall length of the manuscript, we initially omitted a discussion of the complexity of the Min system in certain Firmicutes. However, we agree that this aspect should be considered. Accordingly, we have now added a dedicated paragraph to the Discussion section addressing this point.

      We also agree that investigating different lipid compositions, including native membranes from Bacillus subtilis, represents a logical next step to further elucidate the influence of lipids on the MinD activity cycle. However, we consider this to constitute a separate project and therefore beyond the scope of the present study.

      Recommendations for the authors:

      Reviewing Editors:

      Some minor corrections are requested-the addition of a bit more details about the complexity of Min systems in other bacteria in particular to the discussion as suggested by Reviewer 2 would be very much appreciated.

      We thank the editors for their positive assessment and the clear recommendations. We have now added a dedicated paragraph to the Discussion section addressing the complexity of the Min system in Clostridioides.

      Reviewer #1 (Recommendations for the authors):

      The following corrections are requested:

      Abstract - Line 29 - Remove the word "solely" from this statement of the abstract. It would be wise to not be so rigid for a biological system that is only partially characterized and to allow for the possibility that biological factors, including local concentrations and/or other molecules, may yet be discovered to impact MinD activation under certain conditions.

      We agree and have amended the text to avoid a to restrictive statement.

      Line 38 - Remove "do not require any unknown protein component" for the reason stated above. Currently, the experiments recapitulate activation suggesting the membrane binding and release controls dynamics without additional factors. This allows for the possibility that biological factors may yet be shown to impact MinD activation under certain conditions.

      We agree and have change the text.

      Discussion - Line 526 - Thermus thermophilus is misspelt.

      Corrected.

    1. eLife Assessment

      This important study provides novel information on multi-enzyme complexes, known as metabolons, that form between sequential enzymes in a metabolic pathway. Using an innovative NanoBiT split-luciferase system, the authors present compelling evidence that malate dehydrogenase (MDH1) and citrate synthase (CIT1) dynamically associate under different metabolic conditions in Saccharomyces cerevisiae. The findings suggest the dynamic MDH1-CIT1 interaction facilitates control of TCA pathway flux rate.

    2. Reviewer #1 (Public review):

      Summary:

      The study by the Obata group characterizes the dynamics of the canonical malate dehydrogenase-citrate synthase metabolon in yeast.

      Strengths:

      The study is well-written and appears to give clear demonstrations of this phenomenon.

      Studies of the dynamics of metabolon formation are rare; if the authors can address the concern detailed below, then they have provided such for one of the canonical metabolons in nature.

      Weaknesses:

      There is a fundamental issue with the study, which is that the authors do not provide enough support or information concerning the split luciferase system that they use. Is the binding reversible or not? How the data is interpreted is massively influenced by this fact. What are the pros and cons of this method in comparison to, for example, FLIM-FRET? The authors state that the method is semi-quantitative - can they document this? All of the conclusions are based on the quality of this method. I know that it has been used by others, but at least some preliminary documentation to address these questions is required.

      Comments on revised version:

      I feel that the authors have adequately addressed my prior concerns. I have no further critiques of their work.

    3. Reviewer #2 (Public review):

      This study explores the dynamic association between malate dehydrogenase (MDH1) and citrate synthase (CIT1) in Saccharomyces cerevisiae, with the aim of linking this interaction to respiratory metabolism. Utilizing a NanoBiT split-luciferase system, the authors monitor protein-protein interactions in vivo under various metabolic conditions.

      Major Concerns:

      (1) NanoBiT Signal May Reflect Protein Abundance Rather Than Interaction Strength<br /> In Figure 1C, the authors report increased MDH1-CIT1 interaction under respiratory (acetate) conditions and decreased interaction during fermentation (glucose), as indicated by NanoBiT luminescence. However, this signal appears to correlate strongly with the expression levels of MDH1 and CIT1, raising the possibility that the observed luminescence reflects protein abundance rather than specific interaction dynamics. To resolve this, NanoBiT signals should be normalized to the expression levels of both proteins to distinguish between abundance-driven and interaction-driven changes.

      (2) Lack of Causal Evidence<br /> The study presents a series of metabolic perturbation experiments (e.g., arsenite, AOA, antimycin A, malonate) and correlates changes in metabolite levels with NanoBiT signals. However, these data are correlative and do not establish a functional role for the MDH1-CIT1 interaction in metabolic regulation. To demonstrate causality, the authors should implement approaches to specifically disrupt the MDH1-CIT1 interaction. One strategy could involve using a 15-residue peptide (Pept1) derived from the Pro354-Pro366 region of CIT1, previously shown to mediate the interaction or introducing the cit1Δ3 (Arg362Glu) mutation, which perturbs binding. Metabolic flux analysis using ^13C-labeled glucose and mitochondrial respiration assays (e.g., Seahorse) could then assess functional consequences.

      (3) Absence of Protein Expression Controls Under Perturbation Conditions<br /> In experiments involving acetate, arsenite, AOA, antimycin A, and malonate, the authors infer changes in MDH1-CIT1 association based solely on NanoBiT signals. However, no accompanying data are provided on MDH1 and CIT1 protein levels under these conditions. This omission weakens the conclusions, as altered expression rather than interaction strength could underlie the observed luminescence changes. Immunoblotting or quantitative proteomics should be used to confirm constant protein expression across conditions.

      Conclusion:

      Although the central question is compelling and the use of NanoBiT in live cells is a strength, the manuscript requires additional experimental rigor. Specifically, normalization of interaction signals, introduction of causative perturbations, and validation of protein expression are essential to substantiate the study's claims.

      Comments on revised version:

      The manuscript is much improved.

    4. Reviewer #3 (Public review):

      Summary:

      Metabolons are multisubunit complexes that promote the physical association of sequential enzymes within a metabolic pathway. Such complexes are proposed to increase metabolic flux and efficiency by channeling reaction intermediates between enzymes. The TCA cycle enzymes malate dehydrogenase (MDH1) and citrate synthase (CIT1) have been linked to metabolon formation, yet the conditions under which these enzymes interact, and whether such interactions are dynamic in response to metabolic cues, remains unclear, particularly in the native cellular context. This study uses a nanoBIT protein-protein interaction assay to map the dynamic behavior of the MDH1-CIT1 interaction in response to multiple metabolic stimuli and challenges in yeast. Beyond mapping these interactions in real time, the authors also performed GC-MS metabolomics to map whole cell metabolite alterations across experimental conditions. Finally, the authors use microscale thermophoresis to determine components that alter the MDH1-CIT1 interaction in vitro. Collectively, the authors synthesize their collected data into a model in which the MDH1-CIT1 metabolon dissociates in conditions of low respiratory flux, and is stimulated during conditions of high respiratory flux. While their data largely support these models, some key exceptions are found that suggest this model is likely oversimplified and will require further work to understand the complexities associated with MDH1-CIT1 interaction dynamics. Nonetheless, the authors put forth an interesting and timely toolkit to begin to understand the interaction kinetics and dynamics of key metabolic enzymes that should serve as a platform to begin disentangling these important yet understudied aspects of metabolic regulation.

      Strengths:

      - The authors address an important question: how do metabolon-associated protein protein interactions change across altered metabolic conditions?

      - The development and validation of the MDH1-CIT1 nanoBIT assay provides an important tool to allow the quantification of this protein-protein interaction in vivo. Importantly, the authors demonstrate that the assay allows kinetic and real time assessment of these protein interactions, which reveal interesting and dynamic behavior across conditions.

      - The use of classic biochemical techniques to confirm that pH and various metabolites can alter the MDH1-CIT1 interaction in vitro is rigorous and supports the model put forth by the authors.

      Weaknesses:

      The authors have addressed identified weaknesses within the revision of their manuscript.

    5. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study reports a dynamic association/dissociation between malate dehydrogenase (MDH1) and citrate synthase (CIT1) in Saccharomyces cerevisiae under different metabolic conditions that control TCA pathway flux rate. The research question is timely, the use of the NanoBiT split-luciferase system to monitor protein-protein interactions is innovative, and the significance of the findings is valuable. However, the strength of evidence needed to support the conclusions was found to be incomplete based on a lack of critical control and mechanistic experiments.

      We thank the editor for this thoughtful assessment of our work. We are encouraged that the research question, experimental approach, and overall significance were viewed positively.

      To address the concern regarding the strength of evidence, we have implemented additional controls in the revised manuscript. Specifically, we have repeated all MDH1CIT1 interaction measurements alongside strains expressing full-length NanoLUC fusion proteins to assess MDH1 and CIT1 protein abundance. The resulting data, now included as supplementary figures (Figure 2 – figure supplement 2, Figure 2 – figure supplement 3, Figure 3 – figure supplement 1, Figure 4 – figure supplement 2), demonstrate the reproducibility of the findings and indicate that the observed changes in MDH1-CIT1 interaction are not attributable to protein abundance variations.

      We agree that a detailed mechanistic dissection of how the MDH1–CIT1 complex influences metabolic pathway flux is an essential piece of evidence for establishing the functions of the metabolon. However, such analyses require extensive additional investigation beyond the scope of the present study. Accordingly, we have clarified the aims of this work in the revised manuscript to emphasize that our primary objective is to characterize the dynamic behavior of the MDH1–CIT1 interaction under different metabolic conditions and to identify key factors associated with its regulation.

      We believe these revisions strengthen the rigor of the study, better define its scope, and provide a solid foundation for future mechanistic investigations.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by the Obata group characterizes the dynamics of the canonical malate dehydrogenase-citrate synthase metabolon in yeast.

      Strengths:

      The study is well-written and appears to give clear demonstrations of this phenomenon.

      Studies of the dynamics of metabolon formation are rare; if the authors can address the concern detailed below, then they have provided such for one of the canonical metabolons in nature.

      We sincerely thank the reviewer for their positive assessment and for recognizing the value of our study in characterizing the dynamics of the MDH1-CIT1 metabolon. We appreciate the recognition that studies of metabolon dynamics are rare and that our work provides a clear demonstration of this phenomenon for a canonical metabolon. We have carefully addressed the methodological concerns regarding the NanoBiT system as detailed below to further strengthen the evidence for our findings.

      Weaknesses:

      There is a fundamental issue with the study, which is that the authors do not provide enough support or information concerning the split luciferase system that they use.

      We agree that a detailed description of the NanoBiT system is essential to ensure the reliability of the methodology. As suggested, we have added a dedicated paragraph to the Introduction (Lines 90–103) to clarify these technical aspects, supported by the foundational work of Dixon et al. (2016).

      Is the binding reversible or not? How the data is interpreted is massively influenced by this fact.

      Yes, the NanoBiT system is specifically designed to be reversible. The intrinsic affinity of the subunits is low (K<sub>D</sub> = 190 μM), and the association and dissociation rate constants (k<sub>on</sub> = 500 M<sup>-1</sup>s <sup>-1</sup>, k<sub>off</sub> = 0.2 s<sup>-1</sup>) are well outside the range of typical protein-protein interactions (Dixon et al., 2016). These kinetics ensure that the assembly and disassembly of the luminescent complex are dictated solely by the interaction characteristics of the target proteins (MDH1 and CIT1) and not by the tags themselves. This allows for real-time monitoring of both the association and dissociation phases.

      What are the pros and cons of this method in comparison to, for example, FLIM-FRET?

      We have now explicitly addressed the pros and cons of our methodology compared to fluorescence-based systems:

      Pros: The NanoLUC-based reporter is 150 times brighter than conventional luciferases and has a significantly higher dynamic range (Hall et al 2016), allowing detection of weak transient interactions. Importantly for this study, fluorescence-based methods such as FLIM-FRET and BRET are difficult to implement in yeast microplate assays due to the high levels of cellular autofluorescence. NanoBiT bypasses this issue, providing a high signal-tonoise ratio.

      Cons: Unlike FRET, NanoBiT requires the application of a substrate (furimazine). We did not include this disadvantage in the manuscript because it is not critical in a yeast study. Furimazine can be applied directly to the medium and readily permeates cells.

      The authors state that the method is semi-quantitative - can they document this?

      The semi-quantitative nature of the system is supported by its high dynamic range and the linear relationship between the luminescence signal and the amount of protein complex formed, as documented in Dixon et al. (2016). By using this system in a microplate setting, we were able to monitor relative increases or decreases in interaction levels over time across multiple metabolic conditions, providing a robust comparative analysis of metabolon dynamics.

      All of the conclusions are based on the quality of this method. I know that it has been used by others, but at least some preliminary documentation to address these questions is required.

      We acknowledge the reviewer’s concern regarding the reliance on the NanoBiT system. To ensure the reliability of our conclusions, we have included several lines of evidence to validate the method and demonstrate that the observed luminescence signals accurately reflect protein-protein interaction dynamics.

      To confirm the NanoBiT results using an independent biochemical approach, we performed an in vivo pull-down assay following glucose addition (Figure 2 – figure supplement 1A). The results demonstrate a reduction in the physical association between MDH1 and CIT1. This biochemical validation directly supports the reduction in interaction observed with the NanoBiT system during the Crabtree effect.

      We have provided protein abundance data for both MDH1 and CIT1 across the experimental conditions (Figure 2 – figure supplement 1&3; Figure 3 – figure supplement 1; Figure 4 – figure supplement 2). These results show only minor changes in protein levels, confirming that the fluctuations in the NanoBiT signal are independent of protein expression and represent genuine changes in metabolon assembly.

      To ensure the findings are reproducible, we have included MDH1-CIT1 interaction results from repeated independent experiments (Figure 2 – figure supplement 1&3; Figure 3 – figure supplement 1; Figure 4 – figure supplement 1). The consistency of the results across these trials confirms the robustness of the system in monitoring the metabolic regulation of this complex.

      We hope that these additional experimental validations, alongside the detailed technical description based on the established properties of the NanoBiT system (Dixon et al., 2016; Hall et al., 2012), provide the necessary documentation to satisfy the reviewer’s concerns regarding the quality and reliability of the method.

      Reviewer #2 (Public review):

      This study explores the dynamic association between malate dehydrogenase (MDH1) and citrate synthase (CIT1) in Saccharomyces cerevisiae, with the aim of linking this interaction to respiratory metabolism. Utilizing a NanoBiT split-luciferase system, the authors monitor protein-protein interactions in vivo under various metabolic conditions.

      Major Concerns:

      (1) NanoBiT Signal May Reflect Protein Abundance Rather Than Interaction Strength

      In Figure 1C, the authors report increased MDH1-CIT1 interaction under respiratory (acetate) conditions and decreased interaction during fermentation (glucose), as indicated by NanoBiT luminescence. However, this signal appears to correlate strongly with the expression levels of MDH1 and CIT1, raising the possibility that the observed luminescence reflects protein abundance rather than specific interaction dynamics. To resolve this, NanoBiT signals should be normalized to the expression levels of both proteins to distinguish between abundance-driven and interaction-driven changes.

      We agree that distinguishing between abundance-driven and interaction-driven changes is vital. To address this, we have included new data showing the relative protein levels of MDH1 and CIT1 across all experimental conditions. The protein levels were assessed using yeast lines expressing these proteins tagged with full-length NanoLUC luciferase (Figure 2 – figure supplement 1&3, Figure 3 - figure supplement 1, Figure 4 – figure supplement 2). Using the luminescence data of these relative protein levels, we have included plots showing normalized interaction index (Figure 2 – figure supplement 1G & 3D,H,L; Figure 3 - figure supplement 1D,H,L P; Figure 4 – figure supplement 1D,H,L). This index was calculated by dividing the NanoBiT interaction signal by the product of the relative abundances of both proteins:

      In this formula, NanoBiT, MDH1, and CIT1 are the relative luminescence levels at each time point. This analysis clarified that the changes in the interaction signal significantly exceeded the fluctuations in protein levels, confirming that the dynamics are interactionspecific and not abundance-driven. To provide the most direct and transparent representation of the experimental measurements, we have chosen to keep the raw RLU data in the main figures and have moved the data related to protein abundance and normalization to figure supplements.

      (2) Lack of Causal Evidence

      The study presents a series of metabolic perturbation experiments (e.g., arsenite, AOA, antimycin A, malonate) and correlates changes in metabolite levels with NanoBiT signals. However, these data are correlative and do not establish a functional role for the MDH1CIT1 interaction in metabolic regulation. To demonstrate causality, the authors should implement approaches to specifically disrupt the MDH1-CIT1 interaction. One strategy could involve using a 15-residue peptide (Pept1) derived from the Pro354-Pro366 region of CIT1, previously shown to mediate the interaction, or introducing the cit1Δ3 (Arg362Glu) mutation, which perturbs binding. Metabolic flux analysis using ^13C-labeled glucose and mitochondrial respiration assays (e.g., Seahorse) could then assess functional consequences.

      We agree with the reviewer that the current dataset correlates metabolon assembly with metabolic states rather than establishing a direct causal proof of its functional role in regulating pathway flux.

      However, the primary objective of this manuscript was to establish the dynamic nature of the MDH1-CIT1 metabolon and to demonstrate the causal relationship between the changes in cellular conditions and metabolon dynamics through in vitro and in vivo assessments. Demonstrating that this canonical multienzyme complex undergoes reversible assembly and disassembly in vivo represents a major advance, as metabolon dynamics is a critical, yet previously unrevealed, factor involved in metabolic regulation. We aimed to define the specific environmental triggers that govern these dynamics, providing the necessary foundation for defining the functions of metabolons.

      We completely agree that establishing causality using interaction-deficient mutants coupled with metabolic flux analysis is another critical experiment to establish the functions of the TCA cycle metabolon. We have, in fact, been conducting these precise metabolic flux analyses on CIT1 mutants with disrupted interaction with MDH1. Because the functional consequences of complex disruption involve wide-reaching metabolic rerouting that requires extensive data presentation and modeling, this work forms a separate, comprehensive follow-up study that is currently in preparation for submission in the near future.

      To address this limitation in the current manuscript, we have carefully reviewed and revised the Abstract, Results, Discussion, and Conclusion sections (Lines 19-22; 205; 322-327; 341-342; 458-466). We have removed any language that may have inadvertently implied direct causality. We now explicitly state that our findings indicate the relationship between metabolon dynamics and respiratory conditions, and we have added a clear statement noting that the direct effects of this assembly on metabolic flux are the focus of our forthcoming studies.

      (3) Absence of Protein Expression Controls Under Perturbation Conditions

      In experiments involving acetate, arsenite, AOA, antimycin A, and malonate, the authors infer changes in MDH1-CIT1 association based solely on NanoBiT signals. However, no accompanying data are provided on MDH1 and CIT1 protein levels under these conditions. This omission weakens the conclusions, as altered expression rather than interaction strength could underlie the observed luminescence changes. Immunoblotting or quantitative proteomics should be used to confirm constant protein expression across conditions.

      In response to your first concern, we have now performed protein expression assessments for all experiments, including the perturbation conditions, such as acetate, arsenite, AOA (Figure 3 – figure supplement 1), antimycin A, cyanide, and malonate (Figure 4 – figure supplement 2). The results demonstrate that the protein levels of MDH1 and CIT1 remain relatively stable throughout these treatments and do not correlate with the large changes observed in the interaction signals. This is also demonstrated by the normalized interaction index, which confirms that the shifts in luminescence are driven by the dynamic assembly and disassembly of the MDH1-CIT1 metabolon rather than changes in protein concentrations.

      Conclusion:

      Although the central question is compelling and the use of NanoBiT in live cells is a strength, the manuscript requires additional experimental rigor. Specifically, normalization of interaction signals, introduction of causative perturbations, and validation of protein expression are essential to substantiate the study's claims.

      We sincerely thank the reviewer for recognizing the value of our central question and the strength of the live-cell NanoBiT system, as well as for your rigorous critique that has strengthened this manuscript. To address the concerns regarding experimental rigor, we have now provided extensive validation of MDH1 and CIT1 protein expression across all experimental conditions using yeast lines tagged with the full-length NanoLUC luciferase. These data demonstrate relatively stable protein expression, allowing us to calculate a normalized interaction index that substantiates that the observed luminescence shifts are driven by dynamic metabolon assembly rather than protein concentration. Regarding causative perturbations, we agree that introducing interaction-deficient mutants coupled with isotopic flux analysis is the critical next step to establish functional consequences. Because defining these pathway-wide rerouting events requires extensive modeling, this work will be reported in a follow-up study currently in preparation. Accordingly, we have carefully revised the manuscript to remove language implying direct causality, explicitly framing metabolon dynamics as an integral factor in metabolic regulation closely related to pathway activity and cellular metabolic states. We believe these new quantitative controls, normalizations, and textual clarifications thoroughly address the need for additional rigor and solidly substantiate our findings.

      Reviewer #3 (Public review):

      Summary:

      Metabolons are multisubunit complexes that promote the physical association of sequential enzymes within a metabolic pathway. Such complexes are proposed to increase metabolic flux and efficiency by channeling reaction intermediates between enzymes. The TCA cycle enzymes malate dehydrogenase (MDH1) and citrate synthase (CIT1) have been linked to metabolon formation, yet the conditions under which these enzymes interact, and whether such interactions are dynamic in response to metabolic cues, remain unclear, particularly in the native cellular context. This study uses a nanoBIT protein-protein interaction assay to map the dynamic behavior of the MDH1-CIT1 interaction in response to multiple metabolic stimuli and challenges in yeast. Beyond mapping these interactions in real time, the authors also performed GC-MS metabolomics to map whole-cell metabolite alterations across experimental conditions. Finally, the authors use microscale thermophoresis to determine components that alter the MDH1-CIT1 interaction in vitro. Collectively, the authors synthesize their collected data into a model in which the MDH1CIT1 metabolon dissociates in conditions of low respiratory flux, and is stimulated during conditions of high respiratory flux. While their data largely support these models, some key exceptions are found that suggest this model is likely oversimplified and will require further work to understand the complexities associated with MDH1-CIT1 interaction dynamics. Nonetheless, the authors put forth an interesting and timely toolkit to begin to understand the interaction kinetics and dynamics of key metabolic enzymes that should serve as a platform to begin disentangling these important yet understudied aspects of metabolic regulation.

      We thank the reviewer for this thoughtful and constructive summary of our work. We appreciate the recognition of the novelty and utility of our experimental approach and the integrated analysis of MDH1–CIT1 interaction dynamics.

      We agree with the reviewer that, although our data largely support a model in which MDH1– CIT1 interaction correlates with respiratory activity, there are conditions that do not fully conform to this simplified framework. In the revised manuscript, we have addressed these apparent inconsistencies by providing detailed interpretations of the counterintuitive observations (e.g., ETC inhibition) and emphasizing that the MDH1–CIT1 interaction is modulated by changes in the mitochondrial matrix microenvironment associated with respiratory activity.

      Furthermore, we have revised the Discussion to highlight that the regulation of the MDH1– CIT1 interaction is likely multifactorial, involving the combined effects of pH, metabolites, and other unknown factors, which together enable fine-tuning of metabolic flux in fluctuating environments. This expanded perspective is now more clarified.

      We agree that identifying the precise molecular determinants of MDH1–CIT1 interaction dynamics will require additional mechanistic studies, such as systematic analyses using yeast mutants. While these experiments are an important next step, they are beyond the scope of the present study. We anticipate that the toolkit and framework established here will facilitate such future investigations.

      Strengths:

      (1) The authors address an important question: how do metabolon-associated proteinprotein interactions change across altered metabolic conditions?

      (2) The development and validation of the MDH1-CIT1 nanoBIT assay provides an important tool to allow the quantification of this protein-protein interaction in vivo. Importantly, the authors demonstrate that the assay allows kinetic and real time assessment of these protein interactions, which reveal interesting and dynamic behavior across conditions.

      (3) The use of classic biochemical techniques to confirm that pH and various metabolites can alter the MDH1-CIT1 interaction in vitro is rigorous and supports the model put forth by the authors.

      We thank the reviewer for these positive and encouraging comments. We are pleased that the importance of the research question, the development of the MDH1–CIT1 NanoBiT assay, and the integration of in vivo and in vitro approaches were recognized. We especially appreciate the acknowledgment of the assay’s ability to capture dynamic and kinetic changes in protein–protein interactions, as well as the support provided by the biochemical analyses. We hope that the experimental framework established in this study will serve as a useful platform for further investigations into metabolon dynamics and metabolic regulation.

      Weaknesses:

      (1) Some of the data collected seem to be merely reported rather than synthesized and interpreted for the reader.

      We agree that explicitly synthesizing these findings is essential for clarity. To improve this, we have revised the Results section to include concise summary statements at the conclusion of each major experimental paragraph (Lines 190-191, 201, 218-219, 229-231, 241-242, 272-274, 282-283; 291-293). These additions interpret the data in relation to our main hypothesis. The discussion section was thoroughly revised to more precisely explain the logic supporting the model (Lines 381-393; 433-443, 458-466). Additionally, to bring together the entire dataset, we introduced a new summary schematic (Figure 6A). This figure visually and conceptually integrates our diverse findings, covering metabolic treatments, pH fluctuations, and complex metabolite profiles, showing how these signals work together to control multienzyme complex assembly.

      This is particularly true for data that seem to reflect more complex trends, such as the GCMS experiments that map metabolites across multiple experiments, or treatments that show somewhat counterintuitive results, such as the antimycin A treatment, which promotes rather than disrupts the MDH1-CIT1 interaction.

      We agree that our complex datasets, including the metabolomics and the seemingly counterintuitive Antimycin A results, required deeper synthesis. To clarify the broader metabolic trends, we have added Figure 6A to visually map which factors, specifically pH, malate, fumarate, and aspartate, most consistently align with complex assembly. We revised the Discussion (Lines 390-393, 439-443) to explicitly conclude that no single variable predominantly governs the interaction, but it is coordinately regulated by multiple microenvironmental cues.

      Regarding the Antimycin A (and other ETC inhibitors) discrepancy, where the interaction is enhanced despite suppressed respiration, we have expanded our interpretation (Lines 346–358) to explain this as a transient response that is not directly reflected by steadystate respiratory activity. Specifically, we propose that acute perturbations of the mitochondrial matrix microenvironment, particularly changes in pH, temporarily promote MDH1–CIT1 interaction. Thus, under these conditions, transient microenvironmental changes can dominate over steady-state respiratory output in regulating metabolon assembly.

      The discussion paragraph about the imperfect relationship between pH and interaction has been revised to highlight our conclusion that mitochondrial matrix pH can be a contributing factor rather than the primary regulator (Lines 386-393).

      (2) Some of the assertions put forth in the manuscript are not substantiated by the data presented, and the authors are at times overly reliant on previous findings from the literature to support their claims. This is particularly notable for claims about "TCA cycle flux"; the authors do not perform flux analysis anywhere in their study and should be cautious when insinuating correlations between their observations and "flux".

      We appreciate the reviewer’s careful evaluation of our terminology and fully agree that claims regarding "flux" should be reserved for studies that employ direct isotopic flux measurements. In response to this constructive feedback, we have thoroughly reviewed the manuscript to ensure that our assertions are substantiated by the presented experimental data. We have carefully evaluated the use of the term "flux" throughout the Abstract, Introduction, and Discussion, replacing it with more accurate phrases such as "pathway activity," "respiratory activity," or "mitochondrial respiration" depending on the specific context (Lines 11; 20-21; 50; 111-112; 322-327; 329; 345; 349-350; 442-443; 458466).

      We also removed a paragraph discussing the potential role of the MDH1-CIT1 metabolon in the malate-aspartate shuttle (Line 361). We realized the paragraph is highly speculative, and our data do not directly support the hypothesis. The influence of the MDH1-CIT1 on the malate-aspartate shuttle is a major finding of the upcoming manuscript reporting its effects in metabolic network flux. We apologize for mixing up the results of two separate studies.

      Furthermore, we have revised our conclusions to avoid over-reliance on prior literature in making causal claims. We now explicitly frame the dynamic assembly of the MDH1-CIT1 metabolon as an integral factor in metabolic regulation, closely related to cellular metabolic states, rather than stating that it controls pathway flux (Lines 454-462). We believe these textual revisions accurately align our claims with our current observations and remove any unsubstantiated assertions.

      (3) The manuscript presentation could be improved. For figures, at times, the axes do not have intuitive labels (example, Figure 1A), data points and details about the number of samples analyzed are missing (bar graphs and box plots), and molecular weight markers are not reported on western blots. The authors refer to the figures out of order in the text, which makes the manuscript challenging to navigate as a reader.

      We thank the reviewer for these helpful suggestions to improve the clarity and presentation of the manuscript. We have made several revisions accordingly.

      First, axis labels have been revised throughout the figures to improve clarity and make them more intuitive. Second, we have added the number of biological replicates to the figure captions and updated bar graphs and box plots to display individual data points. Third, to improve the transparency of the immunoblot data, we have included molecular weight marker position in Figure 1C and corresponding full gel images in a new Figure 1 – figure supplement 2. Other immunoblot images have been moved to Figure 2 – figure supplement 1 since they lack molecular marker images.

      In addition, we have reorganized the figure panel labeling and corresponding text to improve the flow of the Results section. Specifically, figure subpanels are now arranged according to the measured parameters rather than treatment conditions, and the relevant sections describing TCA cycle manipulation and ETC inhibition have been revised to follow this updated figure order (Lines 208–231; 251–274). These changes improve the readability and logical progression of the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The grammar in the abstract in the sentence which states called metabolon. This needs to be fixed.

      We thank the reviewer for pointing this out. We have revised the sentence in the Abstract to improve clarity. The revised sentence reads: “The tricarboxylic acid (TCA) cycle enzymes malate dehydrogenase (MDH1) and citrate synthase (CIT1) form a multienzyme complex, referred to as a metabolon, that channels intermediate oxaloacetate between their reaction centers.” (Lines 7-9)

      Reviewer #3 (Recommendations for the authors):

      Major points:

      (1) Much of the data reported in this manuscript reads as a summary of what was found, rather than distilling what the trends in the data mean or how they support the proposed model.

      We thank the reviewer for this comment. This concern overlaps with your previous point (Weakness 1), which we have addressed through revisions to improve synthesis and clarity. Specifically, we have added concise summary statements at the end of each major experimental section (Lines 190-191, 201, 218-219, 229-231, 241-242, 272-274, 282-283; 291-293), and we have included a new summary schematic (Figure 6A) that integrates the findings to illustrate how metabolic conditions and mitochondrial microenvironments relat to MDH1–CIT1 interaction. Together, these revisions improve the interpretation and clarify how the results support our model.

      For instance, in Figure 3, the authors use one metabolic treatment to activate the TCA cycle and two to inhibit the TCA cycle. In Figure 3M, GC-MS data are reported for select metabolites across these three conditions, as well as a control condition. However, these metabolites don't follow clean "trends" according to the predictions; as one example, malate is down in the TCA active (acetate) and one TCA inhibited condition (arsenite), whereas it is elevated in the second TCA inhibited (aminooxyacetate) condition. As an additional example, glutamate is down in the arsenite (inhibited) condition, slightly down in the acetate (activated) condition, but is unchanged in the AOA (inhibited) condition. Similar variability is seen in Figure 4M. What do these discrepancies mean? How do they support the model? As written, these data bring forth more questions than they answer.

      We appreciate the reviewer’s careful analysis of the metabolomics data in Figures 2E, 3M, and 4M. The reviewer notes that the levels of certain metabolites show complex patterns that do not simply reflect overall TCA cycle activity. We have acknowledged that our metabolomics dataset is a valuable resource for the research community and have added a brief paragraph to emphasize the complex metabolic phenotypes resulting from chemical treatments (Lines 422-431).

      As mentioned in the paragraph, this complexity is biologically expected. It is likely from the distinct primary targets of each inhibitor, such as arsenite affecting redox-sensitive enzymes and AOA disrupting the malate-aspartate shuttle, as well as off-target effects and the adaptive reorganization of intersecting metabolic networks to bypass local blockades. Rather than viewing these diverse metabolic phenotypes as discrepancies, we leveraged them to uncouple general respiratory suppression from specific metabolite pools, allowing us to independently assess their relationship with metabolon assembly.

      Furthermore, we note that our GC-MS analysis measures whole-cell metabolite levels, which represent the sum of multiple subcellular compartments and may not precisely reflect localized concentrations within the mitochondrial matrix that is directly affected by the TCA cycle. The description of this limitation of whole-cell metabolomics has been revised in Lines 417-420.

      (2) Why do the authors propose that antimycin A increases the interaction between MDH1 and CIT1 despite decreasing respiratory activity? Given the generalities proposed in Figure 6, this is important to address.

      We thank the reviewer for this comment. This point overlaps with Weakness 1, where we have addressed the apparent discrepancy associated with antimycin A (and other ETC inhibitors). Briefly, we have expanded our interpretation (Lines 349–360) to explain this effect as a transient response that is not directly aligned with steady-state respiratory activity. We propose that acute perturbations of the mitochondrial matrix microenvironment, particularly changes in pH, temporarily promote MDH1–CIT1 interaction. In addition, we have revised the Discussion (Lines 386–404) to clarify that mitochondrial matrix pH acts as a contributing factor rather than the primary regulator of the interaction. Together, these revisions reconcile the ETC inhibition by antimycin A with the overall model presented in Figure 6.

      (3) The authors use acetate to "activate" the TCA cycle; do other non-fermentable carbon sources also promote the MDH1-CIT1 interaction?

      We thank the reviewer for this insightful question. We have tested additional nonfermentable carbon sources and found that they did not significantly affect MDH1–CIT1 interaction (Figure 3—figure supplement 1). We note that raffinose present in the medium likely provides a baseline carbon source supporting oxidative metabolism, which may limit the observable effects of these treatments (Lines 149-150).

      In addition, we performed a new experiment using ethanol. While ethanol treatment enhanced the MDH1–CIT1 interaction signal, it also increased the abundance of MDH1 and CIT1, resulting in a reduced interaction index. Because ethanol induces protein accumulation under our experimental conditions, this result is not straightforward to interpret. We have included this observation and its interpretation in the revised manuscript (Lines 208–211).

      (4) The authors show that the MDH1-CIT1 interaction is sensitive to pH. Is the MDH1-CIT1 interaction affected by uncouplers in vivo?

      We thank the reviewer for suggesting a meaningful experiment. We performed a new experiment examining the effect of the uncoupler CCCP on MDH1–CIT1 interaction in vivo (Figure 4—figure supplement 4). We found that CCCP treatment increased the interaction signal, consistent with the idea that acidification of the mitochondrial matrix promotes MDH1–CIT1 association.

      However, we observe that CCCP treatment also decreased the luciferase signals from MDH1 and CIT1 fused to full-length NanoLUC in an abnormal way, making it harder to interpret the interaction index. Therefore, although these results support a possible role for pH in regulating the interaction, they should be viewed with caution and included as a figure supplement. This experiment and its interpretation have been added to the revised manuscript (Lines 276–283).

      (5) NADH is a potent suppressor of many enzymes within the TCA cycle, including MDH1 and CIT1. Can the authors modulate mitochondrial NADH through genetic manipulation of Ndi1, or through overexpression of mito-Lb-NOX (PMID: 27124460)?

      We thank the reviewer for this insightful suggestion. We agree that the mitochondrial NADH is a potential regulator of the MDH1-CIT1 interaction as it is a potent suppressor of many TCA cycle enzymes, and indeed, we have previously shown that NADH inhibit the MDH-CS interaction in vitro (Omini et al 2021 PMID: 34548590). For this reason, we investigated the mitochondrial matrix redox state that is related to the NADH levels in the current study. The reviewer’s proposed strategy of using targeted genetic tools like mito-Lb-NOX or Ndi1 manipulation to specifically influence the NADH level is an elegant approach to isolate this variable. However, implementing this system requires generating, optimizing, and validating new yeast strains that harbor the targeted NADH-modulating constructs alongside NanoBiT and full-length NanoLUC sensor systems. Because this extensive strain engineering and subsequent live-cell validation fall outside a feasible timeframe for the current manuscript revision, we must respectfully defer these experiments. We view the precise manipulation of the mitochondrial redox state via tools like mito-Lb-NOX as a complementary approach for our future work to systematically pinpoint the individual regulatory factors. We have expanded our Discussion (Lines 417-420; 462-465) to highlight the targeted genetic manipulation of the possible regulatory factors including the NADH pool, as a critical future direction for dissecting these dynamics.

      (6) The authors should correct their figures:

      (a) Axes should be easy to interpret on graphs.

      (b) Individual datapoints should be shown on bar graphs and box plots. Minimally, the number of samples evaluated should be reported.

      (c) Molecular weight markers should be reported on blots.

      We thank the reviewer for these helpful suggestions. Points (a) and (b) overlap with Weakness 3, which we have addressed through revisions to improve figure clarity and data presentation. Specifically, axis labels have been revised to be more intuitive, the number of samples is now reported in the figure captions, and bar and box plots have been updated to include individual data points. For time-course data, we retained point-line plots, as alternative formats (e.g., bar or box plots) would reduce clarity due to the density of time points.

      For point (c), we have added molecular weight markers to the immunoblot data where available (Figure 1C). In the time-course experiment in the original Figure 2, molecular weight markers were absent from the gel images. Although we are confident in the identity of the detected signals, we have moved these data to a figure supplement (Figure 2—figure supplement 1C) to reflect this limitation. Similarly, the corresponding Co-IP data are now presented as a figure supplement (Figure 2—figure supplement 1A).

      Minor points:

      (1) In the last paragraph before the results, the authors refer to "the fluorescent biosensors", but start the paragraph discussing the nanoBIT PPI. After reading the manuscript, these seem to be distinct experimental setups, but that was not evident in the first read through of the paper.

      We thank the reviewer for pointing out this source of confusion. We apologize for the lack of clarity in distinguishing between the experimental approaches. In this study, the NanoBiT system was used to measure MDH1–CIT1 interaction, whereas fluorescent biosensors were used to assess mitochondrial matrix pH, redox state, and ATP levels. We have revised the paragraph to more clearly distinguish these methodologies and their respective roles in the study (Lines 105–112).

      (2) As mentioned above, referring to multiple figures out of order within the manuscript is very jarring for the reader. The authors should consider reworking the narrative or figures to be presented in order.

      We thank the reviewer for this comment. This concern overlaps with the previous comment regarding figure organization, which we have addressed by revising both the figure labeling and the corresponding text. Specifically, figure subpanels have been reorganized to follow the measured parameters rather than treatment conditions, and the Results sections describing TCA cycle manipulation and ETC inhibition have been revised to follow the updated figure order (Lines 208–231; 251–274). These changes improve the logical flow and readability of the manuscript.

    1. eLife Assessment

      This manuscript presents a useful mean-field model for a network of Hodgkin-Huxley neurons retaining the equations for ion exchange between the intracellular and extracellular space. The mean-field model derived in this work relies on approximations and heuristic arguments that, on the one hand, allow a closed-form derivation of the mean-field equations, but also raise questions about their justifications and the degree to which the results agree with experiments as well as direct numerical simulations. While the revised manuscript is much improved, reviewers continue to question the methodology for reducing model dimensionality and therefore the evidence for the utility of this approach remains incomplete at present.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript the authors derive a mean-field model for a network of Hodgkin-Huxley neurons retaining the equations for ion exchange between the intracellular and extracellular space.<br /> The mean-field model derived in this work relies on approximations and heuristic arguments that, on the one hand, allow a closed-form derivation of the mean-field equations, and on the other hand restrict its validity to a limited regime of activity corresponding to quasi-synchronous neuronal populations. Therefore, rather than an exact mean-field representation, the model provides a description of a mesoscopic population of connected neurons driven by ion exchange dynamics.

      Strengths:

      The idea of deriving a mean-field model which relates the slow-timescale biophysical mechanism of ion exchange and transportation in the brain to the fast-timescale electrical activities of large neuronal ensembles.

      Weaknesses:

      The idea underlying this work is not completely implemented in practice.

      The derived mean field model do not show a one-to-one correspondence with the neural network simulations, except in strongly synchronous regimes. The agreement with the in vitro experiment is hardly evident, both for the mean-field model and for the network model. The assumptions made to derive the closed-form equations of the mean field model have not been justified by any biological reason, they just allow for the mathematical derivation. The final form of the mean-field equations do not clarify whether or not microscopic variables are used together with macroscopic variables in an inconsistent mixture.

      Comments on revisions:

      The main weaknesses I listed in the first report are still present, since the authors did not answer my questions on a solid basis. I report the list for completeness:

      (1) It seems that the reduction methodology that is employed is not the most suitable one for the single-neuron model they are considering.<br /> (2) The formulation of the mean-field derivation is unnecessarily complicated. It could be heavily simplified by following previously published approaches to derive biologically realistic neural masses.<br /> (3) The model seems to work only for highly synchronized situations and not for the standard asynchronous evolution usually observed in neural circuits.

      Therefore, my statement remains unchanged.

    3. Reviewer #2 (Public review):

      Summary:

      The authors aiming in developing a neural mass model characterized by few collective variables mimicking the dynamics of a network of Hodgkin - Huxley neurons encompassing ion-exchange mechanisms. They describe in details the derivation of the mean-field model , then they compare experimental results obtained for the hippocampus of a mice with the neural network simulations and the mean-field results. Furthermore, they report a bifurcation analysis of the developed model and simulation of a small network containing various coupled neural masses, somehow moving towards the simulation of an entire connectome.

      Strengths:

      The author attempts to develop a mean-field model for a globally coupled network of heterogeneous Hodgkin-Huxley neurons with explicit ion exchange mechanism between the cell interior and exterior.

      Weaknesses:

      (1) They do not employ the reduction methodology more suited for the single neuron model they consider.<br /> (2) Their derivation of the neural mass model is based on several assumptions, and not all well justified.<br /> (3) Their formulation of the mean-field derivation is unnecessary complicated, it can be strongly simplified by following previously published approaches to derive biologically realistic neural masses.<br /> (4) Their model seems to work only for highly synchronized situations and not for the standard asynchronous evolution usually observed in neural circuits.

      General Statements:

      The authors honestly declared the many limitations of their approach, once assumed this the results of the mean-field are somehow inconsistent with the neural network simulations as expected.

      The authors suggest to employ this model for the simulations on the whole connectome to follow seizure propagation, however I believe that a simpler model, as the Epileptor, remains superior in this respect to this model. That indeed includes biophysical parameters but their correspondence with the ones employed in the network dynamics remain elusive, due to the many assumptions required to derive this mean field model. Furthermore it is more complicated than the Epileptor, I do not think that the present model will be largely employed by the community.

      Comments on revisions:

      The authors have corrected mistakes present in the manuscript and put a correct list of references.

      However, they refuse

      (1) To simplify the formulation of the model, the model contains unnecessary complications, as I have clearly written in my report, the authors agree, but they do not want to change the formulation;

      (2) To derive the mean field model in a simpler way, as possible, and as I asked many times in my Referee report, this would help the readers to understand the important aspect of the derivation, without not needed and confusing complicated formulations;

      (3) To compare direct simulations of the network with neural mass results in sub-section "Bifurcation analysis: emergent network states and multistability" to show bistability, as I asked.

      As a matter of fact the performed modifications do not solve my previous doubts on the validity of the results reported in the manuscript.

      Therefore, my previous assessments remain valid.

    4. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review)

      Summary:

      In this manuscript the authors derive a mean-field model for a network of Hodgkin-Huxley neurons retaining the equations for ion exchange between the intracellular and extracellular space.

      The mean-field model derived in this work relies on approximations and heuristic arguments that, on the one hand, allow a closed-form derivation of the mean-field equations, and on the other hand restrict its validity to a limited regime of activity corresponding to quasi-synchronous neuronal populations. Therefore, rather than an exact mean-field representation, the model provides a description of a mesoscopic population of connected neurons driven by ion exchange dynamics.

      Strengths:

      The idea of deriving a mean-field model which relates the slow-timescale biophysical mechanism of ion exchange and transportation in the brain to the fast-timescale electrical activities of large neuronal ensembles.

      Weaknesses:

      The idea underlying this work is not completely implemented in practice.

      The derived mean field model do not show a one-to-one correspondence with the neural network simulations, except in strongly synchronous regimes. The agreement with the in vitro experiment is hardly evident, both for the mean-field model and for the network model. The assumptions made to derive the closed-form equations of the mean field model have not been justified by any biological reason, they just allow for the mathematical derivation. The final form of the mean-field equations do not clarify whether or not microscopic variables are used together with macroscopic variables in an inconsistent mixture.

      Comments on revisions:

      The main weaknesses I listed in the first report are still present, since the authors did not answer my questions on a solid basis. I report the list for completeness:

      (1) It seems that the reduction methodology that is employed is not the most suitable one for the single-neuron model they are considering.

      (2) The formulation of the mean-field derivation is unnecessarily complicated. It could be heavily simplified by following previously published approaches to derive biologically realistic neural masses.

      (3) The model seems to work only for highly synchronized situations and not for the standard asynchronous evolution usually observed in neural circuits.

      Therefore, my statement remains unchanged.

      Reviewer #2 (Public review)

      Summary:

      The authors aiming in developing a neural mass model characterized by few collective variables mimicking the dynamics of a network of Hodgkin - Huxley neurons encompassing ion-exchange mechanisms. They describe in details the derivation of the mean-field model , then they compare experimental results obtained for the hippocampus of a mice with the neural network simulations and the mean-field results. Furthermore, they report a bifurcation analysis of the developed model and simulation of a small network containing various coupled neural masses, somehow moving towards the simulation of an entire connectome.

      Strengths:

      The author attempts to develop a mean-field model for a globally coupled network of heterogeneous Hodgkin-Huxley neurons with explicit ion exchange mechanism between the cell interior and exterior.

      Weaknesses:

      (1) They do not employ the reduction methodology more suited for the single neuron model they consider.

      (2) Their derivation of the neural mass model is based on several assumptions, and not all well justified.

      (3) Their formulation of the mean-field derivation is unnecessary complicated, it can be strongly simplified by following previously published approaches to derive biologically realistic neural masses.

      (4) Their model seems to work only for highly synchronized situations and not for the standard asynchronous evolution usually observed in neural circuits.

      General Statements:

      The authors honestly declared the many limitations of their approach, once assumed this the results of the mean-field are somehow inconsistent with the neural network simulations as expected.

      The authors suggest to employ this model for the simulations on the whole connectome to follow seizure propagation, however I believe that a simpler model, as the Epileptor, remains superior in this respect to this model. That indeed includes biophysical parameters but their correspondence with the ones employed in the network dynamics remain elusive, due to the many assumptions required to derive this mean field model. Furthermore it is more complicated than the Epileptor, I do not think that the present model will be largely employed by the community.

      Comments on revisions:

      The authors have corrected mistakes present in the manuscript and put a correct list of references.

      However, they refuse

      (1) To simplify the formulation of the model, the model contains unnecessary complications, as I have clearly written in my report, the authors agree, but they do not want to change the formulation;

      (2) To derive the mean field model in a simpler way, as possible, and as I asked many times in my Referee report, this would help the readers to understand the important aspect of the derivation, without not needed and confusing complicated formulations;

      (3) To compare direct simulations of the network with neural mass results in sub-section "Bifurcation analysis: emergent network states and multistability" to show bistability, as I asked.

      As a matter of fact the performed modifications do not solve my previous doubts on the validity of the results reported in the manuscript.

      Therefore, my previous assessments remain valid.

      We thank the editors and the two reviewers for their continued engagement with our manuscript. The three weaknesses retained from the first round are essentially identical between the two public reviews:

      (i) The reduction methodology is not the most suitable for the single-neuron model we consider;

      (ii) The mean-field derivation is unnecessarily complicated;

      (iii) The model works only in highly synchronous regimes and does not reproduce the asynchronous evolution typical of neural circuits.

      Both reviewers explicitly note that their assessments remain unchanged and we have decided not to alter the formulation of the model. We use this response to state—on the public record—exactly where we agree with the reviewers, where we disagree, and why.

      On point (i): the reduction methodology.

      We fully agree with the reviewers' technical observation: the Ott–Antonsen / Lorentzian-ansatz reduction in the form introduced by Montbrió, Pazó and Roxin (2015) is exact for canonical Type I neurons (QIF), whose membrane-potential equation is quadratic, and is not directly applicable to a Type II / Hodgkin–Huxley-type neuron whose voltage dynamics is cubic-like. On this point there is no disagreement.

      Where we differ is in the conclusion the reviewers draw from this observation. The reviewers read our work as applying an inappropriate reduction methodology to an inappropriate neuron model. We instead positioned our work, from the outset, as an extension of that methodology: we keep the biophysically detailed Hodgkin–Huxley substrate (because it is the only level at which extracellular ion concentrations, depolarization block, bursting and seizure-like events are biophysically grounded), and we adapt the reduction by approximating the cubic voltage nullcline as a piece-wise quadratic with two parabolas of opposite curvature. This is explicitly an approximate, not exact, mean-field. The Lorentzian ansatz is then applied on each branch of the piece-wise quadratic, with the limitations of this construction analyzed in the manuscript.

      The reviewers' alternative—starting from a Type I canonical model and grafting on biophysical features—would indeed yield an exact mean-field, but it would forfeit precisely what motivates our work: a tractable mesoscopic description in which the slow variables are physiologically interpretable ion concentrations rather than phenomenological parameters. The trade-off is that we give up exact rigour in order to construct a bridge between the Montbrió-style next-generation neural mass models on one side and the Epileptor on the other, with the additional benefit that the parameters of the resulting neural mass retain a biophysical correspondence (e.g., [K<sup>+</sup>]_bath, Δ[K<sup>+</sup>]_int, [K<sup>+</sup>]_g, the gating variable n) that the Epileptor does not afford.

      We therefore respectfully maintain our position: the methodology is not "the wrong reduction for a Type II neuron"; it is an extended reduction designed to be applicable beyond the Type I case, with explicitly characterized validity.

      On point (ii): the formulation is unnecessarily complicated.

      We agree with the reviewers that, given the assumptions we ultimately adopt, namely that the gating variable n and the potassium concentrations Δ[K<sup>+</sup>]_int and [K<sup>+</sup>]_g are treated as collective (mesoscopic) variables shared by the population, with n a function of the average membrane potential, the closed neural mass equations could be reached by the more direct path used by Guerreiro et al. (2022) and the related literature (R1–R7). In the revised manuscript we now state this explicitly, and we note that the same five-dimensional system arises under either derivation.

      Our choice to follow Chen and Campbell (2022) is motivated by the fact that it makes each approximation visible at the point where it is invoked. In particular, it exposes the moment-closure step (Eq. 19), the vanishing-flux boundary condition (Eq. 28), and the locations where microscopic and mesoscopic variables enter the description. We believe that for a reader trying to extend our framework, for instance to a setting with partial heterogeneity in the slow variables, or with stochastic gating, this is the more useful presentation. We have added a remark stating that the simpler Guerreiro-type derivation reaches the same equations under our assumptions, so that readers can take whichever route they find clearer.

      On point (iii): the model only works in highly synchronous regimes.

      Here we partially agree and partially disagree, and we would like the partial disagreement to appear on the public record.

      We agree that the Lorentzian ansatz is, strictly, valid in regimes where the population's membrane potential distribution is unimodal, that is, when essentially all neurons sit on the same side of the threshold V*. Where we disagree is with the implication that the mean-field model fails outside the strongly synchronous regime. The supplementary analysis in Fig. S2, added in the previous round, quantifies the error introduced by the first-moment approximation of n as a collective variable across the full range of [K<sup>+</sup>]_bath values, spanning quiescent, bursting, seizure-like, sustained ictal and depolarization-block dynamics. The fraction of neurons whose gating variable deviates from the population mean is below 2% for the parameters used throughout the manuscript, and the error becomes appreciable only during the brief transitions between sub- and supra-threshold states. These are precisely the moments at which the population is genuinely bimodal and the single-Lorentzian assumption is theoretically expected to leak. In other words, the error peaks coincide with the moments where our derivation tells us in advance that the assumption is locally invalid; the model "knows where it fails." Away from these transitions, the mean-field tracks the population average across all dynamical regimes shown in Fig. 3, not only in the most strongly synchronized ones.

      This is, in our view, the strongest argument we can make: we are not claiming exactness, and we are not unaware of the limitations. We have characterized them analytically (the construction of the piece-wise Lorentzian, and the theoretical reason a closed solution exists only when the two branches collapse onto one), and we have characterized them numerically (Fig. S2). The deviations are bounded, their location in parameter space is well identified, and they coincide with transitions where the underlying assumption is locally violated. We believe this constitutes a controlled approximation rather than an uncontrolled one, and we would like this distinction to be visible to readers of the Reviewed Preprint.

      We note, in this connection, that the reviewers' preferred reference point, the next-generation neural mass model of Montbrió et al. (2015), which is exact and one-to-one with its underlying network, is exact precisely because the underlying network is a network of QIF neurons. The corresponding statement for a network of Hodgkin–Huxley-type neurons with explicit ion exchange does not, to our knowledge, exist in closed form, and may not exist at all. The relevant question is therefore not whether our model matches the exactness of the QIF case, but whether the controlled approximation we provide is useful. Given the qualitative agreement with neural-network simulations across the full range of [K<sup>+</sup>]_bath, the qualitative agreement with the in vitro recordings, and the recovery of the expected bifurcation structure with new emergent regimes, we believe the answer is yes.

      Other outstanding points in the review.

      Reviewer 2 reiterates the view that the Epileptor remains superior for whole-connectome seizure-propagation simulations because it is simpler and better characterized. We do not dispute that the Epileptor is more thoroughly analyzed and more parsimonious. The complementarity we propose is not a replacement but a parameter-grounding, as the Epileptor's phenomenological parameters (excitability, slow permittivity) acquire, in the present framework, an interpretation in terms of measurable biophysical quantities (extracellular potassium, intracellular potassium variation, glial buffering).

      We thank the reviewers and editors once again for their careful reading, and we are grateful that the points of disagreement have been sharpened to a state where readers can judge them transparently.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors derive a mean-field model for a network of Hodgkin-Huxley neurons retaining the equations for ion exchange between the intracellular and extracellular space.

      The mean-field model derived in this work relies on approximations and heuristic arguments that, on the one hand, allow a closed-form derivation of the mean-field equations, and on the other hand restrict its validity to a limited regime of activity corresponding to quasi-synchronous neuronal populations. Therefore, rather than an exact mean-field representation, the model provides a description of a mesoscopic population of connected neurons driven by ion exchange dynamics.

      We agree with the reviewer's characterization. Our manuscript describes the derivation as relying on "approximations and heuristic arguments" and states that "the derivation is not exact"; what we provide is a controlled, approximate mesoscopic description in which the slow variables are physiologically interpretable ion concentrations rather than phenomenological parameters. An exact closed-form thermodynamic limit is, to our knowledge, available only for canonical Type I (QIF) networks (Montbrió, Pazó and Roxin, 2015) and a few of their extensions; it is not currently known for a Hodgkin–Huxley-type network with explicit ion-exchange dynamics. We acknowledge that the original description of the regime of validity may have caused confusion on this point, and in the revised manuscript we have therefore replaced the looser formulation "strongly synchronous regimes" by the more accurate "regimes where the membrane-potential distribution is unimodal and can be reasonably approximated by a Lorentzian" throughout the manuscript.

      Strengths:

      The idea of deriving a mean-field model that relates the slow-timescale biophysical mechanism of ion exchange and transportation in the brain to the fast-timescale electrical activities of large neuronal ensembles.

      We thank the reviewer for recognizing the motivation behind our work. This explicit coupling between slow biophysical ion dynamics and fast electrical activity is precisely the feature we tried to preserve in the reduction, even at the cost of giving up exactness.

      Weaknesses:

      The idea underlying this work is not completely implemented in practice.

      We address this general statement through the four specific sub-points the reviewer raises in the paragraph that follows.

      The derived mean field model does not show a one-to-one correspondence with the neural network simulations, except in strongly synchronous regimes.

      We partially agree and partially disagree. We agree that the Lorentzian ansatz is strictly valid where the membrane-potential distribution is unimodal, i.e. when essentially all neurons sit on the same side of the threshold V*. We disagree with the implication that the mean-field fails outside this regime. To make this claim quantitative, we added a new supplementary figure (Fig. S2) that quantifies the deviation of individual neurons' gating variables from the population mean across the full range of [K<sup>+</sup>]_bath values—quiescent, bursting, seizure-like, sustained ictal and depolarization-block dynamics. The fraction of deviating neurons is below 2% for the parameters used in the manuscript, with localized peaks only during the brief, genuinely bimodal transitions between sub- and supra-threshold states—precisely the moments at which the theory predicts the assumption to be locally invalid. Away from these transitions, the mean-field tracks the population average across all dynamical regimes shown in Fig. 3, not only in the strongly synchronized ones.

      The agreement with the in vitro experiment is hardly evident, both for the mean-field model and for the network model.

      We acknowledge that the experimental and simulated traces in the original Fig. 4 did not match quantitatively; this was never our intention. The figure and its caption have been reorganized in the revised manuscript to frame the comparison as qualitative: we aim to demonstrate the shared structure i.e., the slow modulation of fast population activity by extracellular potassium fluctuations, rather than to claim a quantitative fit.

      We also added two clarifications that account for the residual differences: (i) the network simulations were intentionally run with rescaled biophysical parameters (membrane capacitance, gating time constants) to keep the computational cost feasible, a standard practice when the goal is to validate dynamical mechanisms rather than absolute timescales; (ii) the in vitro LFP recordings were AC-coupled, so the slow DC components visible in the mean-field traces are filtered out at acquisition.

      The assumptions made to derive the closed-form equations of the mean-field model have not been justified by any biological reason, they just allow for the mathematical derivation.

      We agree that the modelling assumptions were scattered through the original derivation. In the revised manuscript, the three core assumptions are stated explicitly at the point of derivation: (i) the gating variable n is treated as a collective, population-averaged variable; (ii) the potassium concentrations Δ[K<sup>+</sup>]_int and [K<sup>+</sup>]_g are homogeneous across the population, biophysically justified by the rapid redistribution of ions through diffusion and electrochemical gradients, which enforces near-instantaneous equilibration at the mesoscopic scale; (iii) no heterogeneity is assumed at the level of ion dynamics. The meaning of "locally homogeneous" is now defined explicitly.

      On the biophysical motivation of the in vitro perturbation used in the experiment, we have added a new Methods subsection that explains how low extracellular Mg<sup>2+</sup> unblocks NMDARs and abolishes the divalent-cation stabilisation of the resting membrane potential, depolarising hippocampal neurons and increasing the driving force for outward K<sup>+</sup> currents. This provides a biophysical link between the experimental perturbation and the model's main control parameter, the extracellular potassium concentration. We also added a reference to the well-established model of epileptic discharges that underpins the experiment.

      The final form of the mean-field equations does not clarify whether or not microscopic variables are used together with macroscopic variables in an inconsistent mixture.

      We now explicitly acknowledge that in the spiking-network simulations the gating variable n is microscopic (each neuron has its own n_i), whereas in the mean-field derivation it is treated as mesoscopic and shared by the population. This asymmetry between modalities is discussed both in the Results and in the Limitations sections, and is identified as a likely source of some of the discrepancy between the two modalities.

      We have also made the notation in Eqs. (36)–(37) consistent (firing rate r used throughout, full current-based dV/dt̄ restored) and fixed the typos and broken equation/reference labels that contributed to the impression of inconsistency (Eqs. 18, 28, 29; the Fig. 2(c) [K<sup>+</sup>] bath label; the lost reference at line 696).

      Reviewer #2 (Public review):

      Summary:

      The authors aim to develop a neural mass model characterized by a few collective variables mimicking the dynamics of a network of Hodgkin – Huxley neurons encompassing ion-exchange mechanisms. They describe in detail the derivation of the mean-field model, then they compare experimental results obtained for the hippocampus of a mouse with the neural network simulations and the mean-field results. Furthermore, they report a bifurcation analysis of the developed model and simulation of a small network containing various coupled neural masses, somehow moving towards the simulation of an entire connectome.

      We thank the reviewer for the accurate summary of the manuscript's structure and aims.

      Strengths:

      The author attempts to develop a mean-field model for a globally coupled network of heterogeneous Hodgkin-Huxley neurons with an explicit ion exchange mechanism between the cell interior and exterior.

      We thank the reviewer for recognizing this objective. The retention of Hodgkin–Huxley dynamics with explicit ion exchange is precisely the feature that distinguishes our framework from QIF-based reductions, and it is what enables the slow variables of the resulting mean-field to retain a direct biophysical interpretation.

      Weaknesses:

      (1) It seems that the reduction methodology that is employed is not the most suitable one for the single-neuron model they are considering.

      We agree, on technical grounds, with the observation: the Ott–Antonsen / Lorentzian-ansatz reduction is exact for canonical Type I neurons (QIF) and is not directly applicable to a Type II Hodgkin–Huxley-type neuron with a cubic-like voltage nullcline. Where we differ is in the conclusion. We did not apply an inappropriate reduction to an inappropriate neuron; we deliberately extended the methodology by approximating the cubic nullcline as a piece-wise quadratic with two parabolas of opposite curvature, and then applying the Lorentzian ansatz on each branch. The result is an explicitly approximate, biophysically grounded mean-field, with its regime of validity stated and quantified (Fig. S2).

      To make this positioning explicit, we have added a paragraph to the Introduction that situates our work within the next-generation neural mass literature (Byrne et al. 2020; Montbrió, Pazó & Roxin 2015; Guerreiro et al. 2022; Forrester et al. 2024; Perl et al. 2023; Gerster et al. 2021; and works on short-term plasticity, adaptation, conductance-based reductions,

      spike-timing-dependent plasticity, random connectivity and noise) and clarifies that we see our contribution as complementary to these approaches, not as a competitor to the exact QIF reductions.

      (2) The authors' derivation of the neural mass model is based on several assumptions, and not all well justified.

      We agree that, in the original submission, the modelling assumptions were scattered through the derivation. In the revised manuscript, the three core assumptions are stated explicitly at the point of derivation: (i) the gating variable n is treated as a collective population-averaged variable; (ii) the potassium concentrations Δ[K<sup>+</sup>]_int and [K<sup>+</sup>]_g are homogeneous across the population, biophysically justified by the rapid redistribution of ions through diffusion and electrochemical gradients, which enforces near-instantaneous equilibration at the mesoscopic scale; (iii) no heterogeneity at the level of ion dynamics is assumed. The meaning of "locally homogeneous" is now defined explicitly. In addition, we have added Fig. S2, which quantifies numerically the error introduced by the moment-closure assumption (deviation below 2% for the parameters used in the manuscript).

      (3) The formulation of the mean-field derivation is unnecessarily complicated. It could be heavily simplified by following previously published approaches to derive biologically realistic neural masses.

      We agree that, under the assumptions ultimately adopted in our model—namely that n, Δ[K<sup>+</sup>]_int and [K<sup>+</sup>]_g are mesoscopic—the final five-dimensional system can be reached by the more direct path used by Guerreiro et al. (2022) and the related literature. We now state this explicitly in the revised manuscript and note that the same system arises under either derivation, so that the reader can take whichever route they find clearer. Our choice to retain the Chen and Campbell (2022) formalism is pedagogical: it exposes the moment-closure step (Eq. 19), the vanishing-flux boundary condition (Eq. 28), and the locations where microscopic versus mesoscopic variables enter the description, which is the more useful presentation for a reader wishing to extend the framework (e.g. to partial heterogeneity in the slow variables or to stochastic gating). We also made the notation in Eqs. (36)–(37) consistent (firing rate r used throughout, full current-based dV/dt̄ restored) and fixed a number of typos and broken equation/reference labels.

      (4) The model seems to work only for highly synchronized situations and not for the standard asynchronous evolution usually observed in neural circuits.

      We partially agree and partially disagree. We agree that the Lorentzian ansatz is strictly valid where the membrane-potential distribution is unimodal; we have replaced "strongly synchronous regimes" by this more accurate formulation throughout the manuscript. We disagree, however, with the implication that the mean-field is useful only in those regimes. Fig. S2, added in this revision, explicitly quantifies the deviation across all dynamical regimes (quiescent, bursting, seizure-like, sustained ictal and depolarization-block dynamics): it remains below 2% for the parameters used in the manuscript, with localized peaks only during the brief sub-to-supra-threshold transitions where the population is genuinely bimodal. Away from these transitions, the mean-field tracks the population average across all dynamical regimes shown in Fig. 3.

      General Statements:

      The authors honestly declared the many limitations of their approach. It is assumed that the results of the mean-field are somehow inconsistent with the neural network simulations as expected.

      We thank the reviewer for acknowledging that the limitations are honestly declared. As detailed above and quantified in Fig. S2, the deviation from the network simulations is bounded and well characterized; it is not assumed but measured.

      The authors suggest employing this model for the simulations on the whole connectome to follow seizure propagation, however, I believe that the Epileptor remains superior in this respect to this model. That indeed includes biophysical parameters but their correspondence with the ones employed in the network dynamics remains elusive, due to the many assumptions required to derive this mean-field model. Furthermore, it is more complicated than the Epileptor, I do not think that the present model will be largely employed by the community.

      We do not propose our model as a direct replacement for the Epileptor and we do not dispute that the Epileptor is more thoroughly analyzed and more parsimonious. The complementarity we propose is not a replacement but a parameter-grounding: the Epileptor's phenomenological parameters (excitability, slow permittivity) acquire, in our framework, a concrete interpretation in terms of measurable biophysical variables (extracellular potassium, intracellular potassium variation, glial buffering). Retaining the Hodgkin–Huxley substrate is essential to ground these variables biophysically.

      To make this complementarity more visible, the Limitations and Discussion section has been expanded to discuss the choice of a purely excitatory network as a first step (with excitatory–inhibitory generalizations available via the synaptic reversal potential) and to point to additional biological ingredients (calcium and other ions, plastic synapses, random connectivity and noise, adaptation, spike-timing-dependent plasticity) that the framework can accommodate, with reference to the next-generation neural mass literature.

      We thank the reviewers and editors for their careful reading. We hope this public response makes our reasoning, the limits of our approach, and the concrete revisions made in this round transparent.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In general, the writing is scattered. Every time a model is introduced, one starts from the general formulation only to find that a very simplified case is used with respect to that formulation, which is very confusing. Authors need to reduce unnecessary formulations that confuse the reader and make it clear which formulations are actually used.

      We thank the reviewer for this comment and understand the concern regarding the balance between general formulations and specific approximations. Our intention in including the more general equations and derivations (e.g., Eq. 7 and others) was pedagogical — to ensure completeness and transparency in the modeling steps, especially for readers less familiar with mean-field reductions of biophysically detailed models. These general forms also serve to clarify the assumptions underlying the simplifications we employ. In the latest version, we improved the clarity of core equations (e.g., Eq. 37), which form the basis of all simulations presented (see details below, in the answer to question 14).

      (2) The Introduction would benefit from a wider view of the literature. The literature on exact mean field models (i.e. derived from the Lorentzian Ansatz) has flourished in the last years. In particular, it would be worth considering the following papers, where exact neural mass models are applied to perform whole-brain and large-scale brain simulations:

      Forrester, M., Petros, S., Cattell, O., Lai, Y. M., O'Dea, R. D., Sotiropoulos, S., & Coombes, S. (2024). Whole brain functional connectivity: Insights from next generation neural mass modelling incorporating electrical synapses. PLOS Computational Biology, 20(12), e1012647.

      Perl, Y. S., Zamora-Lopez, G., Montbrio, E., Monge-Asensio, M., Vohryzek, J., Fittipaldi, S.,

      Campo, C. G., Moguilner, S., Ibanez, A., Tagliazucchi, E., Yeo, B. T. T., Kringelbach, M. L., & Deco, G. (2023). The impact of regional heterogeneity in whole-brain dynamics in the presence of oscillations. Network Neuroscience, 7(2), 632-660.

      Byrne, Aine, James Ross, Rachel Nicks, and Stephen Coombes. "Mean-field models for EEG/MEG: from oscillations to waves." Brain topography 35, no. 1 (2022): 36-53.

      Gerster, M., Taher, H., Skoch, A., Hlinka, J., Guye, M., Bartolomei, F.,... & Olmi, S. (2021). Patient-specific network connectivity combined with a next generation neural mass model to test clinical hypothesis of seizure propagation. Frontiers in Systems Neuroscience, 15, 675272.

      Byrne, Aine, Reuben D. O'Dea, Michael Forrester, James Ross, and Stephen Coombes. "Next-generation neural mass and field modeling." Journal of neurophysiology 123, no. 2 (2020): 726-742.

      Benitez-Stulz, Sophie, Samy Castro, Gregory Dumont, Boris Gutkin, and Demian Battaglia. "Compensating functional connectivity changes due to structural connectivity damage via modifications of local dynamics." bioRxiv (2024): 2024-05.

      We have added the following paragraph:

      “Recently, a class of these models, called next-generation neural mass models [42], has been developed based on an analytical approach introduced by [25] that allowed for the exact derivation of mean field parameters for a population of quadratic integrate-and-fire (QIF) neurons. These can be linked to EEG/MEG oscillations [43], including epipeltic seizures [43], and have been used to study various aspects of the whole-brain dynamics such as the low-dimensional manifold of the resting state [45,46], aging [47] and neural signatures of consciousness [48].”

      We have also modified the preceding paragraph of the introduction that now reads:

      “At the mesoscopic level, the observable properties of a neuronal ensemble are generally explained by statistical physics formalism of mean-field theory [19-22]. Mean-field models demonstrated a predictive value for studying the mesoscopic dynamics of neuronal populations [23], providing statistical descriptions of neuronal networks [2, 19, 24-29], which can be used to address questions related to network-level mechanisms [12, 24, 30].

      In general, neural mass models have a low enough number of parameters to be tractable and provide general intuitions regarding mechanisms underlying complex neuronal activity [31-36]. For example, statistical population measures, such as the firing rate, can be used to assess mesoscopic dynamics [1, 7, 31, 36-41].”

      (3) Moreover, conductance-based models have been already implemented in neural mass models not only in references [69, 71, 95], but also in:

      Guerreiro, I. C., Di Volo, M., & Gutkin, B. (2023). A new generation of reduction methods for networks of neurons with complex dynamic phenotypes.

      Capone, C., Di Volo, M., Romagnoni, A., Mattia, M., & Destexhe, A. (2019). State-dependent mean-field formalism to model different activity states in conductance-based networks of spiking neurons. Physical Review E, 100(6), 062413.

      We have added the following sentence:

      “Moreover, conductance-based couplings between the spiking neurons have been already implemented in neural mass models [58, 59, 91, 93, 121], but without an extracellular exchange mechanism.”

      (4) Sec. 1.1 As previously established in the literature, a system of all-to-all coupled neuronal equations can be solved exactly in the thermodynamic limit (i.e., infinite neurons limit) if the single neuron membrane potential equation is a quadratic function and if the instantaneous distribution of membrane potentials of neurons in a population is described by a Lorentzian [Montbrió, E., Pazó, D. & Roxin, A. Physical Review X 5 (2), 021028 (2015)]. This means that the thermodynamic limit can be performed for a Canonical Type I model like the quadratic integrate-and-fire.

      What is the biological justification and the reason to approximate a different neuron type (a type II neuron model), whose membrane potential equation resembles a cubic function, with a quadratic function? The fact that it can be solved in the quadratic approximation is not, in my opinion, a sufficient justification. It would be more correct to start from a type I neuron at the microscopic level with a quadratic function and then provide additional biological features.

      We thank the reviewer for raising this important point. We respectfully disagree with the notion that starting from a canonical Type I model (such as the quadratic integrate-and-fire neuron) would be a more biologically grounded approach. While the quadratic form is analytically convenient, it does not capture certain key features of neuronal excitability particularly those related to bursting, seizure-like events, and depolarization block which are closely tied to the cubic-like nullcline geometry arising in Hodgkin–Huxley-type models, especially in the presence of slow ion dynamics.

      Our work seeks to bridge biophysical realism with analytical tractability. The step-wise quadratic approximation we employ is specifically designed to mimic the cubic membrane potential profile that emerges from the full ion-exchange dynamics. While the Lorentzian Ansatz is not strictly justified in this case from first principles, we show that it yields a workable and biologically interpretable mean-field description, which aligns with single-neuron dynamics, population simulations, and even in vitro observations. To our knowledge, this is a novel contribution that extends mean-field modeling beyond currently available approaches, which are often restricted to simplified or phenomenological neuron models.

      In this context, using a quadratic approximation is not merely a mathematical convenience — it is a means to retain key dynamical features of more realistic (non-Type I) neurons within a tractable framework, enabling insights into complex behaviors like multistability and pathological bursting.

      (5) Sec. 1.2 As shown in Figure 3, the mean-field equations do not show a one-to-one correspondence with the neural network simulations, except in strongly synchronous regimes. This represents a strong limitation in the model, especially because exact neural mass models (as shown in Reference [23]) perfectly fit the dynamics of the underlying network model both in the asynchronous and in the synchronized regime.

      We appreciate the reviewer’s observation and acknowledge that our original description may have caused confusion. The model's validity is not strictly limited to strongly synchronous regimes, but rather to regimes where the distribution of membrane potentials across the neuronal population remains unimodal and can be reasonably approximated by a Lorentzian. This includes but is not restricted to—highly synchronized states.

      We agree that this distinction is important and have clarified it in the revised manuscript (e.g., “in strongly synchronous regimes” —> “in regimes where the membrane potentials' distribution is unimodal and can be reasonably approximated by a Lorentzian”).

      In contrast to exact mean-field reductions based on quadratic integrate-and-fire neurons (e.g., [23]), our model originates from a biophysically grounded HH-type neuron with ion exchange dynamics, and necessarily involves heuristic approximations to achieve a closed-form mean-field description. While this results in a less exact correspondence with network simulations in more heterogeneous or bimodal states, our goal was to retain biological interpretability and account for phenomena such as ion-driven bursting and seizure-like transitions, which are not captured by standard QIF-based neural masses.

      We see our contribution as complementary to existing exact reductions — offering a biophysically grounded alternative that remains tractable and informative in a relevant class of unimodal, mesoscopic dynamical regimes.

      (6) Sec. 1.3 In this section the authors show the comparison between in vitro experiments and simulations with both the network model and the neural mass model (Figure 4, panels a,b,c). The qualitative agreement that is supposed to be shown is hardly evident. The shape of the signals is different as is the type of bursting. The only agreement results in the fact that there are repeated spiking events at successive times in a periodic manner. However, the time scale of the simulations is different for neural network simulation and mean-field experiment, making it difficult to compare them. While the period of the bursting event is around 2 min for mean field simulation (in according with experiments), the time scale of the network simulation is 60 times smaller, thus meaning that we are considering completely different mechanisms and phenomena. The justification given by the authors, that "the parameters were modified to simulate shorter fluctuations (in the network of Hodgkin-Huxley neurons) for computational efficiency" is inappropriate.

      The poor agreement turns out to be even worse in the comparison between experiments and mean-field simulations shown in panels d and e of Figure 4. While the mean field simulation is characterized by a periodic behaviour both in the mean membrane potential and in the external potassium concentration, the in-vitro traces are not periodic and show an increasing irregular activity of the extracellular LFP in correspondence with increasing external potassium concentration.

      How it is possible to justify the implementation of this model if the working hypotheses are not supported by the results? The worst agreement of the network simulations with the experiments reinforces the doubt raised in the previous point: what is the reasoning underlying the choice of Hodgkin-Huxley as a single neuron model?

      We thank the reviewer for this detailed critique. We acknowledge that the comparisons in Figure 4 involve limitations and we now provide a clearer rationale and context in the revised manuscript. First, we emphasize that our intention is not to claim a quantitative match between the experimental and simulated traces, but rather to demonstrate that our model grounded in biophysical mechanisms such as ion exchange is capable of qualitatively reproducing a key feature observed experimentally: the slow modulation of neuronal activity by extracellular potassium concentration. For example, both in vitro (Fig. 4a, 4d) and in our simulations (Fig. 4b, 4e), bursts of activity ride on slower oscillations of potassium, and the interplay of fast and slow dynamics is central to both.

      Regarding the discrepancy in timescales between the neural network and mean-field simulations: the network simulations were intentionally run with accelerated dynamics by rescaling biophysical parameters (e.g., membrane capacitance and gating time constants) to keep the computational cost feasible. We now clarify in the manuscript that this choice is standard practice in computational modeling when the primary goal is to validate dynamical mechanisms rather than replicate absolute timescales.

      On the shape of LFP signals: the experimental recordings were AC-coupled, and the DC components associated with slower shifts in membrane potential such as those modeled in the mean-field simulations are not captured in those recordings. This limits the visibility of key features like the underlying potential jumps. Additionally, no claim is made regarding a specific bursting classification in either data or simulation.

      We agree that the experimental trace in Fig. 4d shows more complex, non-periodic dynamics (e.g., slowing burst frequency and irregularity), which are not captured by our current deterministic model. These differences could plausibly arise from additional physiological processes (e.g., stochastic transitions between metastable regimes or variability in ion regulation) that are not modeled here. In future work, such phenomena may be captured by introducing noise or parameter variability (see, e.g., Saggio et al., A taxonomy of seizure dynamotypes , elife 2020), or by allowing the parabola coefficients in the nullcline approximation to vary dynamically.

      Finally, regarding the choice of a Hodgkin–Huxley-type neuron: this model allows us to incorporate a biophysical description of ion exchange, which is central to the phenomena we study. While modeling the spiking mechanisms explicitly precludes certain mathematical simplifications available to very simplified neuron models with reset, it enables direct links between mesoscopic dynamics and measurable quantities such as extracellular potassium an essential objective of our work. To summarize, we rearranged Fig4:

      Potassium can have periodic behavior with V bursting riding on top (Fig.4 a). The model also shows this behavior at different timescales (Fig. b,c,e).

      AC LFP recording is filtered so we might not see the V jump during the bursts (because we do not have DC recordings). No claim about bursting class here.

      Potassium can also have more complex behavior (e.g., slowing down of burst frequency Fig.4.d), that the deterministic model do not show, but maybe exploring dynamical parameters (e.g., from parabolas or K_bath) or with added noise allowing to jump between regimes (reference Saggio et al. eLife 2020).

      (7) Sec. 1.5 Here six neural masses are coupled via long-range structural connections with random weights. Simulations of the system are shown for two different values of the global coupling parameter (G = 0 and G = 100). How many realisations of the network have been considered?

      We thank the reviewer for pointing this out. The presented simulation was intended as a proof-of-concept demonstration to illustrate the model’s capacity to support network-level propagation of pathological activity. For this purpose, we considered a single representative realization of the structural connectivity with random weights. Given the deterministic nature of the model and the qualitative focus of the demonstration, additional realizations do not qualitatively change the observed behavior — namely, the transition from localized to network-wide bursting as coupling strength increases. We have now clarified this in the revised manuscript.

      “This simulation serves as a proof of concept to illustrate how local pathological activity can propagate through a network depending on the strength of coupling. We used a single representative realization of randomly weighted structural connectivity. While we did not perform a systematic exploration of different realizations or coupling strengths, we observed that the qualitative behavior namely, the emergence of network-wide bursting beyond a critical coupling threshold remains robust across similar setups. The model is compatible with empirical connectome data and can be readily extended to simulations using realistic brain network architectures.”

      In future applications involving data-driven network architectures or variability analyses, we agree that exploring multiple realizations or empirical connectomes will be valuable.

      How do the results depend on the different choices of the random weights? What is the dependence of the emergent dynamics on G? What kind of dynamics can be observed varying smoothly the parameter G (e.g. from 0 to 100)?

      This section serves as a proof of concept to show that pathological activity in one node can propagate through the network when coupling is strong. We used a single random weight configuration and did not systematically explore variations in G or connectivity. While richer dynamics likely emerge across intermediate values of G, a full parameter sweep is beyond the scope of this study. We clarify this in the revised text (see answer above).

      (8) Sec. 2.1 In the description of the experiment it is mentioned that only Mg^{2+} is varied. What is the role played by Mg^{2+} variation in influencing the external potassium concentration variation? How the experiment can be linked to the model? How the hypothesis of introducing an equation for the potassium concentration current in the microscopic model is supported by the experiment and vice-versa?

      We thank the reviewer for this question. We have added a new subsection in the Methods explaining the.agnesium removal as a mean to influence the external potassium dynamics:

      “The membrane of hippocampal neurons is equipped with N-methyl-D-aspartate type glutamate receptors (NMDARs). These receptors have a very high affinity for glutamate and can, in principle, be activated by ambient glutamate present at low concentrations in the brain extracellular fluid (ECF). Under normal physiological conditions, this activation does not occur because extracellular magnesium ions (Mg<sup>2+</sup>) block the NMDAR channel at membrane potentials more negative than about –50 mV; this voltage-dependent block prevents receptor activation at rest. When extracellular magnesium is removed, the block is relieved, allowing NMDARs to be activated, leading to neuronal depolarization toward the action potential threshold [117].”

      “In addition, as a divalent cation, Mg<sup>2+</sup> interacts with the negatively charged neuronal membrane, contributing to the stabilization of the resting membrane potential. Lowering extracellular magnesium concentration disrupts this effect, resulting in membrane depolarization [118].”

      “Consequently, magnesium removal not only facilitates NMDAR-dependent depolarization, but also directly depolarizes neurons. This depolarization increases the driving force for outward potassium currents through K<sup>+</sup> channels, meaning that variations in Mg<sup>2+</sup> can indirectly influence external potassium dynamics during neuronal activity.”

      (9) Sec. 2.6 The modified version of the continuity equation has been derived following Reference [95], where the authors consider a network of Izhikevich neurons, and each neuron is modelled by a two-dimensional system consisting of a quadratic integrate and fire equation plus an equation that implements spike frequency adaptation. In particular, in [95] the authors achieve a closed set of mean-field equations with the inclusion of the mean-field dynamics of the adaptation variable by using a Lorentzian ansatz combined with the moment closure approach. The moment closure condition is also assumed in the present manuscript (Eq. 19). Under which assumptions is the implementation of the moment closure condition justified?

      We are thankful to the reviewer (and also to the R2) for pointing out to the validity of the justification of the assumptions that we have used in our formalism. We hence agree that the moment closure is not a sufficient justification for assuming that V depends on the mean n, which is neccessary for the derivation of Eq. 20, but in addition we need the assumption that n can be treated as a collective variable as it is done in the works mentioned by the reviewer 2. In addition we have performed numerical simulations of the full system to calculate the error term introduced by this approximation, and the results in the new Fig. S2 show that this is below 2% for each of the different dynamical regimes.

      We have hence modified the justification for Eq. (19) reading:

      “Next we assume a first-order moment closure condition for the variable n [59], justified by the numerical simulations of the full network (see Fig. S2) which show that for most of the neurons (close to 99 \% for the value of ∆ same as in the other simulations) the mean of the population is well capturing the behavior of the single neurons [122]. Finally, putting together these factors and assuming that n can be treated as a collective variable for each neuron (see Limitations of the model} section) we arrive to ” and also

      “The validity of the first moment closure, Eqs. (19), as in [59], is supported by the numerical simulations, which show that, both, during the silent regime and when seizure-like events occur, n<sub>i</sub> for most neurons track the network averaged ⟨n | V, η⟩. In particular, it is less than 2% of the neurons that fire while the mean is low, and vice-versa, Fig. S2. In less synchronized scenarios (larger ∆ or smaller J), however, this value would increase, but the mean would always capture the qualitative behaviour of the population.”

      This is also now explicitly mentioned in the following paragraph:

      “Unlike the mean membrane potential ⟨V⟩ and the firing rate (r), which can be explicitly derived from the continuity equation under the Lorentzian assumption, the expression for ⟨n(t)⟩ in Eq. (26) is formal. In our mean-field model, the gating variable (n) is treated as a global population variable, evolving deterministically as a function of the average membrane potential. Therefore, ⟨n(t)⟩ corresponds to the collective gating variable assumed to be shared by all neurons, and is not computed by averaging distinct microscopic (n<sub>i</sub>) values.”

      (10) Considering also the comments reported above, I think that it would make more sense to start from an Izhikevich neuron model as microscopic model and add the equations for the ionic currents as mesoscopic variables (i.e. written as population average variables), instead of starting from the Hodgkin-Huxley single neuron model and trying to make hardly justifiable approximations and simplifications.

      We respectfully disagree. While the Izhikevich model is computationally efficient, it lacks the biophysical detail required to capture key ion-driven mechanisms such as depolarization block, slow ion accumulation, and specific burst-initiation dynamics all of which are central to our study. The Hodgkin–Huxley framework, despite requiring approximation, provides the necessary physiological grounding to link microscopic ion exchange with emergent population behavior.

      (11) Sec. 2.7 What is the advantage of using six more parameters to fit, like R-,R+,c-,c+,I-,I+?

      This is in contradiction with the spirit of deriving a mean-field model, where the number of parameters should be reduced. What is the advantage of this mean-field derivation with respect to other mean-field derivations of Hodgkin-Huxley neurons, like the one in Reference [9]?

      The additional parameters (R±, c±, I±) are not arbitrary they compactly parametrize the cubic-like nonlinearity of the membrane potential dynamics in our stepwise-quadratic approximation. This trade-off allows us to preserve essential biophysical features of HH neurons (e.g., bursting regimes, depolarization block) within a tractable analytic framework. Compared to alternative approaches like in ref. [9], which focus on phenomenological reductions and do not yield an ODE system, our model offers more direct interpretability in terms of ion dynamics, providing a closer link between microscopic mechanisms and mesoscopic activity patterns.

      (12) Sec. 2.11 The derivation of the mean-field dynamics for the gating variable is rather heavy and difficult to follow. This section could be simplified, whilst also better explaining the underlying approximations and the validity of these approximations, which is currently missing.

      We agree that the derivation is technical, but we chose to retain it for transparency, as it follows the Chen and Campbell approach and makes key approximations such as moment closure explicit. We have now added a clarification that n is treated as a collective variable We hope that the current level of detail helps readers understand the assumptions underlying the gating variable dynamics.

      (13) Sec. 2.12 The derivation of Eqs. (36) is quite confusing and needs to be re-written in a clearer form. Why are both the variables x and r present in these equations, since they are proportional according to Eq. (25)?

      We thank the reviewer for pointing this out. We have adjusted the equations to improve clarity and now consistently express the firing rate in terms of a single variable. This removes the redundancy and simplifies the presentation.

      (14) Sec. 2.13 The derivation of Eqs. (37) is quite confusing and needs to be rewritten in a clearer form.

      Both the auxiliary variable x and the firing rate r are present in this equation, the same as in Eq. (36). Therefore it is presented as a set of equations for the auxiliary variable x and for the physical variable V. Moreover in the equation for dV/dt, the quadratic term in V has disappeared and it is not clear to me which are the variables corresponding to I- and I+. In particular, in Eqs. (36) there are two different current terms I-,I+ for the two equations related to dy/dt. In Eqs. (37) there is a single term (I_{cl} +I_{Na}+I_K+I_{pump})/C_m which is identical for both equations related to dV/dt. I was expecting two different terms also in Eqs. (37).

      We appreciate the reviewer’s close reading. To improve clarity, we now express the dynamics in terms of the firing rate r, replacing \dot{x} with \dot{r} in both Eq. (36) and Eq. (37) to avoid confusion.

      As for the current terms: in Eq. (37), we reverse the stepwise quadratic approximation and reintroduce the original ionic currents from Eq. (16). This is why the expressions involving I_{\text{cl}}, I_{\text{Na}}, I_K, and I_{\text{pump}} appear as a single summed term in \dot{V}, rather than the split I_-,I_+ terms used in the stepwise approximation. We now clarify this in the text.

      We also write V as \bar{V} to clarify that it refers to the average membrane potential for the neuronal population. Finally, we wrote the final equation in a more compact form to improve clarity (new Eq.38).

      (15) Moreover, while the equation for the gating variable n can be considered as a differential equation for a mesoscopic variable since n depends on average values only, it is not clear to me if the remaining variables 𝛥[K+]_{int}, [K+]_g can be considered mesoscopic or not. Since Eqs. (37) represent a mean-field model, I expect every variable to be a mean-field variable. This could be easily achievable for the extracellular potassium concentration, but I do not understand how a site-specific microscopic variable like the intracellular potassium concentration variation can be automatically inserted in a set of mean-field equations without any averaging or intermediate steps. This is a crucial point to be clarified for the validity of the neural mass equations.

      We thank the reviewer for raising this important point. In our model, we assume spatial homogeneity at the mesoscopic scale, meaning that ion concentrations — both intra- and extracellular — are uniformly distributed across the population. As a result, variables such as \Delta[K^+]_{\text{int}}, Δ[K+]int and [K+]g are treated as population-level averages, consistent with the mean-field framework.

      Moreover, the rate of change of intracellular potassium is tightly coupled to extracellular dynamics via ion exchange mechanisms, justifying its inclusion as a slow, mesoscopic variable. We now clarify this modeling assumption explicitly in the text.

      “By locally homogeneous, we mean that all neurons in the population are assumed to share the same extracellular and intracellular ionic environment and are connected with identical coupling rules, allowing us to treat the population as uniform with respect to ion dynamics and connectivity.”

      “These slow variables are in addition considered to be mesoscopic, meaning they are identical for every neuron in the population.”

      Minor points:

      (1) Figure 2, panel d. Please detail the variable on the y-axis, which is not reported in the figure.

      Done

      (2) Eq. (15) is cited in many parts of the manuscript, while it seems to me it would be more appropriate to reference Eq. (2). Is this a mistake or is there a reason to cite Eq. (15)?

      The reviewer is correct, we have had a wrong equation label, which we have now corrected.

      (3) Figure 4 Would it be possible to show enlargements of the mean membrane potential traces to directly compare the different bursting types shown by the simulation of the different models?

      The panel d already contains enlarged part of the membrane potential traces. For the rest, going back to the Q6, we want to stress again that our intention is not to claim a quantitative match between the experimental and simulated traces.

      (4) Figure 5 In the caption the author refers to "the generic model, single neuron model, and epileptor model". Could you please better explain the models referred to and why they are mentioned? Are the generic model and the single neuron model those that are presented in the Materials and Methods section? Or do you refer to completely different models, as for the epileptor?

      We have removed the reference to the generic model (we had in mind the canonical model for seizures by Saggio et al. 2017), since it is not mentioned in the paper, and we have clarified that the single neuron model and epileptor model, which were used to simulate seizure like events.

      (5) Sec 2.5 As already stated above, the authors need to reduce unnecessary formulations that confuse the reader. Here, for example, Eqs. (6) and (7) are unnecessary, in view of the fact that delta spikes are used (Eq. 8).

      We thank the reviewer for the suggestion, but we disagree, and we think it is better to start the derivations from the more general case, as done with Eqs. 6-7.

      (6) Sec. 2.6 Could you please better explain why in Eqs. (15) and (16), the variable V0 is introduced, while before and after this, the variable V is used?

      We thank the reviewer for the comment. In Eqs. (15) and (16), \dot{V}_0 denotes the free term of the membrane potential equation, i.e., the component driven solely by the intrinsic ionic currents and excluding the synaptic input I_syn. Only this \dot{V}_0 term (a function rather than an independent variable) is approximated by the piece-wise quadratic expression in Eq.(21). In contrast, the variable V represents the membrane–potential variable, which dynamics is obtained by combining \dot{V}_0 with the synaptic current contribution I_syn. In summary, there is no independent variable V_0; only the function \dot{V}_0 is introduced to represent the intrinsic (non-synaptic) component of the membrane–potential dynamics. We have now clarified this in the text.

      (7) In the square brackets of the r.h.s. of Eq. (18), for all the intermediate steps, it appears G^n(V,n) ϱ^V, while there should be G^n(V,n) ϱ^n.

      We thank the reviewer for catching this typo. We have corrected this in the revised manuscript.

      (8) Sec. 2.8 Here the authors affirm that "a double-Lorentzian (or a piece-wise Lorentzian) could be a suitable form for ρ^V (t, V | η). However, it is not clear under which conditions such an assumption would allow a solution to the continuity equation". What are the problems underlying the implementation of the double Lorentzian? It seems to be a more correct form than the single Lorentzian actually implemented.

      We thank the reviewer for this thoughtful question. In principle, a double-Lorentzian ansatz for \rho^V can indeed be implemented in several reasonable ways–for example, by enforcing that the combined area of the two Lorentzian components is normalized to one (to preserve the probabilistic interpretation) and by imposing smoothness constraints at their boundaries. However, despite exploring these implementations, we were unable to obtain non-trivial solutions of the continuity equation under this parametrization. The only solvable case we found is the degenerate one in which the two Lorentzians collapse onto each other (i.e., (x_- = x_+) and (y_- = y_+)), which reduces the ansatz to the single-Lorentzian form used in the manuscript. For this reason, although the double-Lorentzian is conceptually appealing, it did not yield practically useful solutions within our framework.

      (9) Eq. (28). The symbols used for the flux (especially those used in the second-to-last step once the inner integration is performed) are confusing and it is difficult to understand what they mean.

      We thank the reviewer for noting this issue. The problem was due to a LaTeX typo that prevented the vertical lines—indicating that the flux is evaluated at specific points—from rendering correctly. We have now corrected this.

      (10) Eq. (29) In the third step there are some misprints that impair comprehension.

      We thank the reviewer for noting this. We have corrected these misprints in the revised version.

      (11) Line 696. The reference is not displayed.

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      As a really general remark, this manuscript is written in a confusing manner, the authors present their model in a general formulation and their analysis in a complicated way that in the end is not needed, as I will explain in detail in the following.

      Another general question is why the authors want to employ the neural mass reduction methodology developed in [23] to obtain exact mean-field evolution for quadratic neurons (like the quadratic integrate and fire (QIF)) for a model that reveals a cubic dependence on the membrane potential, as the FizhHugh-Nagumo neuron (that indeed is a 2d reduction of the Hodkgin-Huxley model), to obtain an approximate neural mass model that somehow works qualitatively only for synchronized dynamics? Why not use another approach more suited to derive the neural mass model for cubic nonlinearity, as the one suggested in [33] and [69] by Di Volo and co-authors? What is the rationale behind the choice of the authors?

      We appreciate the reviewer’s critical feedback and the opportunity to clarify our methodological choices. Our decision to base the mean-field model on Hodgkin–Huxley-type neurons stems from the need to retain ion channel dynamics, which are essential to capture the coupling between membrane activity and extracellular ionic concentrations. This biophysical link is central to our study and cannot be achieved using more abstract neuron models such as QIF or FitzHugh-Nagumo alone.

      Regarding the mean-field reduction method: while the Ott-Antonsen/Lorentzian framework is indeed exact for QIF neurons, we adopted a stepwise quadratic approximation to apply a similar formalism to the cubic-like dynamics of the HH model. This choice enables us to analytically capture a rich set of behaviors, including bursting, depolarization block, and seizure-like dynamics, in a tractable mean-field system.

      We considered the approach of Di Volo and colleagues [33, 69], but their methodology is tailored to asynchronous irregular regimes, whereas our model is specifically designed to capture dynamics in quasi-synchronous or bursting regimes — including epileptiform activity — which are not covered by the assumptions of the Di Volo framework.

      We now clarify these modeling choices more explicitly in the revised manuscript.

      "Unlike phenomenological or reduced models, the Hodgkin–Huxley framework allows us to retain explicit ion exchange dynamics, which are essential for linking membrane behavior to extracellular potassium fluctuations. This level of biophysical detail is crucial for modeling pathological regimes such as seizure onset and propagation."

      Furthermore, the derivation of the neural mass equations is unnecessarily complicated, as a matter of fact, they approximate all the variables (except the membrane potentials of the single neurons) as collective variables (i.e. the gating variable and the potassium concentration) common to all the neurons. The neural network model for which they derive the neural mass model presents microscopic evolutions of the membrane potential cubic-like plus other global variables equal for all neurons, that depend on collective variables such as the mean membrane potential or the mean firing rate. Once clarified, the derivation of the neural mass model is much simpler, and it is not necessary to follow the approach reported in Reference [95] [Chen, L. & Campbell, S. A. Exact mean-field models for spiking neural networks with adaptation. Journal of Computational Neuroscience 50 (4), 445-469 (2022)] which is unnecessarily complicated. The authors can follow a much simpler methodology as explained by Guerriero et al in Reference [R6] (cited below) where the authors consider the same model studied in [95]. Such a methodology has been applied in many cases already, to introduce realistic aspects in the neural mass model [23] (see References [R1-R7] below). I strongly encourage the authors to reformulate their approach in a simpler and clearer manner, by following the approach reported in [R1-R7]. The manuscript will become more readable and it will gain in comprehension.

      We thank the reviewer for this helpful suggestion. We agree that, given the assumptions made in our derivation (i.e., shared gating and ion concentration variables across neurons), the mean-field equations could alternatively be obtained using the simpler methodology proposed by Guerriero et al. [R6] and related works [R1–R7]. However, we chose to follow the derivation presented by Chen and Campbell [95] because it makes the approximations (e.g., moment closure, flux boundary assumptions) explicit and generalizable to future extensions. However, we also acknowledge that the assumption of n to be treated as a collective variable is needed, and for clarity, we have now added a remark in the manuscript indicating that the same result could be recovered more directly using the approach of Guerriero et al.

      “We note that, under the assumption of globally shared gating and ion concentration variables across the neuronal population, the resulting mean-field equations can also be derived using simpler methods as proposed by Guerriero et al [58]. In this work, we follow the more general formalism of Chen and Campbell [59], which makes the role of key approximations (e.g., moment closure, vanishing flux at boundaries) explicit. This also facilitates potential generalizations to settings with partial heterogeneity or dynamic gating distributions.”

      “Finally, putting together these factors and assuming that n can be treated as a collective variable for each neuron”

      “Unlike the mean membrane potential ⟨V⟩ and the firing rate (r), which can be explicitly derived from the continuity equation under the Lorentzian assumption, the expression for ⟨n(t)⟩ in Eq. (26) is formal. In our mean-field model, the gating variable (n) is treated as a global population variable, evolving deterministically as a function of the average membrane potential. Therefore, ⟨n(t)⟩ corresponds to the collective gating variable assumed to be shared by all neurons, and is not computed by averaging distinct microscopic (n<sub>i</sub>) values.”

      Now I will examine in detail all the manuscript and report comments/remarks/suggestions numbered as (Q#) on how to improve the present manuscript to render it easier to read and more comprehensible, these are not minor remarks, just detailed ones.

      Introduction

      (Q1) The Introduction section needs a part devoted to the reduction methodology developed in [23] for QIF neurons and a presentation of previous works dealing with the introduction of biologically realistic aspects in the neural mass model derived in [23]. Here is a non exhaustive list of such papers concerning the introduction of the following realistic aspects in the neural mass developed in [23]:

      (I) short-term synaptic plasticity :

      [R1] Exact neural mass model for synaptic-based working memory H Taher, A Torcini, S Olmi, PLOS Computational Biology 16 (12), e1008533 (2020)

      [R2] Bursting in a next generation neural mass model with synaptic dynamics: a slow-fast approach H Taher, D Avitabile, M Desroches, Nonlinear Dynamics 108 (4), 4261-4285 (2022)

      [R3] Mean-field approximations of networks of spiking neurons with short-term synaptic plasticity R Gast, K Thomas R, H Schmidt, Physical Review E 104 (4), 044310 (2021)

      (II) spike frequency adaptation:

      [R4] Gast, Richard, Helmut Schmidt, Thomas R. Knösche. "A mean-field description of bursting dynamics in spiking neural networks with short-term adaptation." Neural computation 32.9 (2020): 1615-1634.

      [R5] Population spiking and bursting in next-generation neural masses with spike-frequency adaptation, A Ferrara, D Angulo-Garcia, A Torcini, S Olmi, Physical Review E 107 (2), 024311 (2023).

      (III) conductance-based neuron with a slow current (Izekievic model):

      [R6] A new generation of reduction methods for networks of neurons with complex dynamic phenotypes,IC Guerreiro, M Di Volo, B Gutkin, preprint arxiv: 2206.10370 (2022)

      (IV) spike timing-dependent plasticity:

      [R7] Mean-field approximations with adaptive coupling for networks with spike-timing-dependent plasticity, B Duchet, C Bick, Á Byrne, Neural computation 35 (9), 1481-1528 (2023).

      (V) random connectivity and noise:

      [R8] Mean-field models of populations of quadratic integrate-and-fire neurons with noise on the basis of the circular cumulant approach

      DS Goldobin Chaos: An Interdisciplinary Journal of Nonlinear Science 31 (8) (2021)

      [R9] A reduction methodology for fluctuation-driven population dynamics DS Goldobin, M Di Volo, A Torcini, Phys. Rev. Lett. 127, 038301 (2021)

      [R10] Shot noise in next-generation neural mass models for finite-size networks VV Klinshov, SY Kirillov Physical Review E 106 (6), L062302 (2022)

      I think the authors should refer in the introduction to these previous papers, where realistic biological aspects have been already introduced in the neural mass model developed in [23].

      We have added a whole pragaraph devoted to the next-generation neural mass models and in particular to the other works introducing biological realism in this class of models:

      “Recently, a class of these models, called next-generation neural mass models [42], has been developed based on an analytical approach introduced by [25] that allowed for the exact derivation of mean field parameters for a population of quadratic integrate-and-fire (QIF) neurons. These can be linked to EEG/MEG oscillations [43], including epipeltic seizures [44], and have been used to study various aspects of the whole-brain dynamics such as the low-dimensional manifold of the resting state [45, 46], aging [47] and neural sig natures of consciousness [48]. Number of works dealt with the introduction of biologically realistic aspects in the mostly phenomenological neural mass model derived in [25]. These included short-term synaptic plasticity [49–51], spike frequency adaptation [52, 53], spike timing-dependent plasticity [54], synaptic delay [29], random connectivity and noise [55–57], as well as an extension of the conductance-based neurons with a recovery variable [58–60].”

      (Q2) Line 117 - Please specify what you mean by locally homogeneous, here.

      Thank you for allowing us the opportunity to clarify this. We now report:

      "By locally homogeneous, we mean that all neurons in the population are assumed to share the same extracellular and intracellular ionic environment and are connected with identical coupling rules, allowing us to treat the population as uniform with respect to ion dynamics and connectivity."

      (Q3) In this sub-section the authors should clarify all the hypotheses they employ to derive the neural mass models, not only the Lorentzian approximation they did for a cubic model, but also the fact that they assume that the gating variable n is a global variable as well as that the potassium concentration are assumed to be the same for all neurons, that they assume no heterogeneity at this level. This is a fundamental aspect that should be clarified at this stage already.

      We thank the reviewer for this important observation. We agree and have revised the text in the derivation section to explicitly state all key assumptions. Specifically, we now clarify that:

      (1) The gating variable n is treated as a population-average (global) variable;

      (2) The potassium concentrations Δ[K+]int and [K+]g are assumed to be homogeneous across the neuronal population; and (3) No heterogeneity is assumed at the level of the ion dynamics.

      This assumption is biophysically motivated: ion concentrations — particularly extracellular potassium — tend to redistribute rapidly due to diffusion and electrochemical forces, leading to an effectively well-mixed environment at the mesoscopic scale. As such, assigning separate compartments to individual neurons is not justified in this modeling context. We now explicitly note this in the manuscript to avoid ambiguity.

      “3) We assume that the potassium concentrations, both intracellular(\( \Delta[K^+]_{\text{int}} \)) and extracellular (through the buffering variable \( [K^+]_g \)), are homogeneous across the neuronal population. This is justified physiologically by the rapid redistribution of ions through diffusion and electrochemical gradients, which enforce near-instantaneous equilibration at the mesoscopic scale. As such, assigning separate compartments to each neuron is neither practical nor biologically meaningful in this context. We assume that the potassium concentrations, both intracellular (\( \Delta[K^+]_{\text{int}} \)) and extracellular (through the buffering variable \( [K^+]_g \)), are homogeneous across the neuronal population. This is justified physiologically by the rapid redistribution of ions through diffusion and electrochemical gradients, which enforce near-instantaneous equilibration at the mesoscopic scale. As such, assigning separate compartments to each neuron is neither practical nor biologically meaningful in this context; 4) We assume that the gating variable n, which governs potassium conductance, can be treated as a population-averaged variable. This allows us to describe the neuronal ensemble using a reduced set of collective (mean-field) variables.”

      Comparison with neural network simulations

      (Q4) The comparison the authors perform between the microscopic model and the neural mass is misleading, From what the authors wrote it seems that you are considering 4 variables for each neuron in the network model (this is unclear from how the model is written in Eq (9)), I guess one for the membrane potential, one for the gating variable and two for the potassium concentration. However, this is not the network model for which the neural mass has been developed, the neural mass has been obtained for a network made of N + 3 variables (N membrane potentials and 3 collective variables for gate, and potassium concentrations) this is a sort of mesoscopic network models, analogously to what done previously in references [R1,R3,R4] above and others. If the authors would compare their neural mass with this mesoscopic model the agreement among the two would be improved.

      We agree with reviewer’s observation and we now acknowledge this issue in the Results and in the Limitations. We have already modified the text to explicitly state that for the mean filed derivations n is treated as a collective variable and we have added the following statements:

      “Also note that the gating variable n is treated as microscopic in the neural network, while in the derivations for the mean-field it is considered as a mesoscopic and identical for the whole population. This is likely responsible for some of the discrepancies between the two modalities.”

      “Moreover, the discrepancy between the two modalities would have likely been smaller if for the neural network we also adopted a gating variable that is mesoscopic and identical across the spiking neurons, as in similar works [49–51]. However, here we demonstrate the validity of the mean-field approximation even for the more natural, microscopic representation of the gating variable in the neural network.”

      Comparison with in vitro experiments

      (Q5) Experiment -- The experiment is performed in vitro on the intact Hippocampus of mice between postnatal days P5-P7. It is known [R1] that neuronal activity at an early developmental stage is provided in the Hippocampus by a network primarily driven by synchronized GABA_A that provides an excitatory action and generates giant depolarizing potentials (GDPs) [R11]. However, GDPs have frequencies in the range of 1 Hz - 0.1 Hz, not matching the oscillation frequencies reported by the authors. I have several questions here:

      (E1) At this stage P5-P7 are the interactions among neurons essentially excitatory? Or not, please explain why, Are the oscillations reported by the authors somehow related to GDPs? The depolarizing action of GABAergic transmission and the presence of GDPs during early rodent brain development, as described by Ben-Ari and some others researchers, are characteristics commonly observed in ex vivo brain preparations, but are not evident under physiological in vivo conditions (see doi: 10.3389/fphar.2012.00065).

      In our preparation—intact mouse hippocampus—GABAergic synaptic transmission is not depolarizing. This is evidenced by the fact that inhibition of ionotropic GABA_A receptors with bicuculline triggers interictal-like discharges, which are routinely used as a model of epileptiform activity (see doi: 10.1016/j.nbd.2014.12.013). Therefore, in our experiments at P5–P7, neuronal interactions are not purely excitatory, and the observed low Mg2+ induced oscillations are not related to GDP.

      (E2) What is the nature of the oscillations reported by the authors in Figure 4 ? Which is their origin, please explain in the text of the paper clearly.

      The model of epileptic discharges presented in our study was first introduced over 20 years ago and has since become a well-established paradigm for screening potential antiepileptic drugs and research on the mechanism of epileptic seizure. A detailed description of this model can be found in doi: 10.1046/j.1460-9568.2002.02143.x, and its pharmacological properties are reviewed in doi: 10.1046/j.1528-1157.2003.19503.x. These references have now been added to the manuscript for clarity.

      We have added the following:

      “The model of epileptic discharges presented in our study was first introduced over 20 years ago [115] and has since become a well-established paradigm for screening potential antiepileptic drugs and research on the mechanism of epileptic seizure [116].”

      (E3) How exactly does the concentration of extracellular potassium ions change, this is not clear even in Methods, please clarify.

      [R11] Excitatory actions of GABA during development: the nature of the nurture Y Ben-Ari, Nature Reviews Neuroscience 3 (9), 728-739 (2002).

      We have now added a new Subsection in the methods explaining how we use Mg2+ variation to influence the external potasium variation.

      “The membrane of hippocampal neurons is equipped with N-methyl-D aspartate type glutamate receptors (NMDARs). These receptors have a very high affinity for glutamate and can, in principle, be activated by ambient glutamate present at low concentrations in the brain extracellular fluid (ECF).Under normal physiological conditions, this activation does not occur because extracellular magnesium ions (Mg<sup>2+</sup>) block the NMDAR channel at membrane potentials more negative than about –50 mV; this voltage-dependent block prevents receptor activation at rest. When extracellular magnesium is removed, the block is relieved, allowing NMDARs to be activated, leading to neuronal depolarization toward the action potential threshold [117]. In addition, as a divalent cation, Mg<sup>2+</sup> interacts with the negatively charged neuronal membrane, contributing to the stabilization of the resting membrane potential. Lowering extracellular magnesium concentration disrupts this effect, resulting in membrane depolarization [118]”

      “Consequently, magnesium removal not only facilitates NMDAR-dependent depolarization, but also directly depolarizes neurons. This depolarization increases the driving force for outward potassium currents through K<sup>+</sup> channels, meaning that variations in Mg<sup>2+</sup> can indirectly influence external potassium dynamics during neuronal activity.”

      (Q6) Lines 187-191 and Figure 4 -- The authors wrote : "In Figure 4.c we show the membrane potential and external potassium for a simulation of N = 3000 coupled HH-like neurons showing a similar behavior, although the parameters were modified to simulate shorter fluctuations for computational efficiency." This sentence is unclear. What is clear from Figure 4 is that the network simulations gave rise to collective oscillations on a completely different scale seconds with respect to minutes and also the profile of the potassium concentration has a clearly different evolution. From Figure 4 one can conclude that network simulations have nothing to do with the neural mass evolution and the experiment. I think the authors should better clarify and describe the results reported in Figure 4.

      We thank the reviewer for the observation. We have revised the relevant section of the manuscript to clarify the interpretation of Figure 4 and avoid any implication of quantitative matching. As stated in our response to Reviewer 1 (comment 6), the comparison is intended to highlight the shared qualitative structure across experimental data, the neural mass model, and the network simulation — specifically, the modulation of fast bursting by slow extracellular potassium fluctuations. The difference in timescale in the network simulation arises from rescaled parameters used for computational efficiency. We now explicitly state this and have updated the figure caption and accompanying text accordingly to reflect these points.

      (Q7) Why do the authors consider a purely excitatory network to describe the experimental results? What is the reason for this choice? Why they do not consider as usual balanced excitatory- inhibitory networks? Please clarify this point.

      We thank the reviewer for raising this point. We chose to model a purely excitatory network as a first step in isolating the role of extracellular potassium dynamics in generating population-level bursting. This allows us to focus on the ion-driven modulation mechanisms without introducing additional complexity from inhibitory feedback. Similar modeling choices have been made in previous studies of bursting and seizure-like dynamics (e.g., Gutkin et al.,), where inhibition is omitted to emphasize intrinsic or modulatory mechanisms. We acknowledge that incorporating inhibitory populations is an important next step for capturing a broader range of dynamics, but for the current study, the excitatory-only network provides a minimal and interpretable framework aligned with our focus.

      (Q8) By comparing Figures 4 (a) and (b) it seems that the bursting activity observed in the experiment and in the mean-field simulations seem quite different, originating from different mechanisms and bifurcations, Can the authors comment on this?

      We thank the reviewer for this important observation. We have reorganized the presentation of Figure 4 and revised the accompanying text to better clarify the nature of the comparison (see also our response to Reviewer 1, point 6). Our aim is not to claim that the experimental and simulated bursts arise from identical bifurcation mechanisms, but rather to highlight shared qualitative features — in particular, slow modulation of population activity by extracellular potassium. We now also comment on the potential role of more complex or noise-driven bifurcations (see Saggio et al. 2020) in shaping experimental bursting dynamics, which are not fully captured by the current deterministic model.

      Bifurcation analysis: emergent network states and multistability

      (Q9) This sub-section will gain interest by reporting simulations of the network and of the neural mass model presenting bistable dynamics.

      We agree with the reviewer that this would be an important addition, but we believe that it goes beyond the scope of this work (for the computational reasons among others) and it remains for future work. We have however updated the bifurcation analysis section.

      Limitations of the model

      (Q10) Lines 276- 280 -- I think that the parameters c+,c_,R+,R_ depend not only on the slow variables, potassium concentrations but also on the actual value of the gate variable n. This should be stressed.

      We thank the reviewer for this helpful observation. We agree and have clarified in the revised manuscript. This reflects the mean-field assumption that n is treated as a collective variable, and we now make this dependency explicit in the text.

      “Furthermore, the parabola coefficients c_-,c_+, R_-, R_+ were fixed as constants, however, these coefficients could be made functions of the slow variables and the gating variable, which might unveil new dynamical regimes and extend the validity of the thermodynamic limit beyond the regimes described in this work. Also, in the case of constant values, an in-depth exploration of the parameter space is required to fully characterize the model and its bifurcation structure.”

      (Q11) The authors wrote: " Other limiting assumptions are the moment closure condition (19) and the assumptions that the functions (3) averaged across the neuronal population can be expressed as functions of the average membrane potential V and gating variable n (which is only true in the cases where the functions (3) can be reasonably approximated as linear functions in a range of V and n." Apart from that a parenthesis is lacking, I think that this last aspect has been already taken into account when performing the fit with 2 parabolas to the sum of the currents, or not? In case, please specify.

      We thank the reviewer for catching the missing parenthesis — this has been corrected in the revised manuscript. Regarding the modeling point: the two-parabola fit applies specifically to the membrane potential dynamics and captures the nonlinear dependence of the total current on V (eq.16). In contrast, the moment closure assumption involves approximating averages of nonlinear functions of both V and n, such as those appearing in the gating dynamics (e.g., n∞(V)). This is not directly accounted for by the parabola approximation, but is handled separately via the mean-field approximation of G^n as a function of the average variables (eq.15).

      (Q12) A limitation that should be stressed is that the authors in the neural mass model consider the gate variable and the potassium concentrations, as global variable equal for all neurons, and where n depends on the mena membrane potential, to write that the moment closure (19) is a limiting assumption is honestly too clear, please be explicit here.

      We have now the following two statements:

      “These slow variables are in addition considered to be mesoscopic, meaning they are identical for every neuron in the population.”

      “In our mean-field model, the gating variable (n) is treated as a global population variable, evolving deterministically as a function of the average membrane potential. Therefore, ⟨n(t)⟩ corresponds to the collective gating variable assumed to be shared by all neurons, and is not computed by averaging distinct microscopic (n<sub>i</sub>) values.”

      Discussion

      (Q13) The authors could discuss in this section the further biological ingredients they can introduce in their neural mass based on the previous works [R1-R9] that have already shown how to include plastic synapses, random connectivity, noise, adaptation, spike-timing-dependent plasticity, etc and which of these ingredients they consider more relevant for the whole brain dynamics.

      In order not to repeat the same statements from the Introduction, we have now addded the following sentence:

      “This approach, taking into account key biophysical details, offers a first step in considering the role of the glia in neural tissue excitability. Following this direction, other ions, such as calcium should be taken into consideration, as well as other effects such as plastic synapses, random connectivity, noise, adaptation, spike-timing-dependent plasticity, as already discussed in the Introduction.”

      (Q14) The authors should also discuss why they limited their analysis to purely excitatory networks, and what would change by including excitatory-inhibitory interactions in each single mass and across neural masses, if this makes sense or not.

      As stated in our response to Q7, we chose to focus on purely excitatory networks as a first step to isolate and study the core role of extracellular potassium dynamics in driving bursting behavior. This modeling choice allows for a minimal system where the interaction between intrinsic ionic mechanisms and network coupling is most transparent.

      We also note that excitatory and inhibitory effects can be modeled within the same formalism by adjusting the synaptic reversal potential — for example, $E_{syn}=0$mV for excitatory, and $E_{syn}=-80$mV for inhibitory interactions. Including inhibitory populations would introduce additional complexity and richer dynamical regimes (e.g., oscillatory instabilities, balance states), which are certainly of interest but beyond the scope of this study.

      Materials and Methods

      (Q15) Fig.2 - I think a plus is lost in panel (c) where it should be [K+bath];

      Thank you. We corrected the figure.

      (Q16) Caption of Figure 2- the authors wrote: "In the case where the derivative of the membrane potential is zero for V > V ⋆ (e.g., if the cubic function is shifted up by adding a constant current to the membrane potential derivative), the population is described by the red distribution in the steady state, and the continuity equation is governed by the negative parabola equation." This sentence is unclear, the authors mean in the case where the derivative of the membrane potential crosses zero at V > V*? Please clarify.

      We thank the reviewer for pointing this out. Yes, we refer to the case where the membrane potential derivative crosses zero at a point V>V∗. We have clarified this in the revised figure caption.

      (Q17) Lines 558-562 -- Eqs (6) and (7) are examples of unnecessary complications of which this manuscript is full of. Since the authors do not consider any synaptic dynamics and homogenous (equal) couplings, these equations are not needed, I strongly recommend removing Eqs (6) and (7) and limiting to the expression reported in Eq (8), which indeed should also be corrected see next remark.

      We appreciate the reviewer’s concern regarding clarity. As mentioned in our response to Reviewer 1, the inclusion of Eqs. (6) and (7) was intentional and serves a pedagogical purpose — to present the general structure of the network interactions before introducing simplifying assumptions. While we agree that Eq. (8) suffices for the simulations considered in this manuscript, we believe that showing the more general form helps clarify the model’s extensibility, for instance to cases with heterogeneous coupling or synaptic dynamics.

      (Q18) Eq (8) - line 562 - Since the authors assume no synaptic evolution, i.e. instantaneous post-synaptic potentials, they can clarify that Eq (8) represents the population firing rate that later will be one of the fundamental variables of the neural mass model and call it r, as in the following. Furthermore, $s_i$ does not depend on the neuron index $i$ in a fully coupled network with homogenous coupling, as in the present case, this quantity is the same for all neurons. Please drop the index and call it r since it is the population firing rate.

      We thank the reviewer for this useful suggestion. We now clarify in the text that under the assumptions of all-to-all homogeneous coupling and no synaptic dynamics, s_i is identical for all neurons and can be interpreted as the population firing rate r. This connection is made explicit in the revised manuscript.

      “Under the assumption of instantaneous synaptic transmission and homogeneous all-to-all coupling, the synaptic activation variable (s<sub>i</sub>) is the same for all neurons and corresponds to the population firing rate, which we denote by (r)”

      (Q19) Line 564-567 - Here the network model is incomplete, it is not sufficient that the authors report the evolution equation for the membrane potential Eq (9). They should report the evolution equation for the gate variable n and for the potassium concentration as done in Eq (1). This request is fundamental because it is unclear from the present formulation which are the variables that are microscopic (associated with the single neuron evolution) and which are global (common to all the neurons). This is a fundamental aspect and it should be clarified. I guess that n will depend on the neuron index $i$, while the potassium concentration it is unclear how the authors will consider them, global or local. I guess that the internal density should depend on the neuron index $i$ or not ? Anyway, I would like to know exactly which network model has been simulated e.g. to obtain the results reported in Figure 3.

      We thank the reviewer for this essential clarification request. In the revised manuscript, we now explicitly state the full network model, including the evolution equations for the gating variable n_i and potassium variables. While in some simulations we consider the full microscopic model involving 4N variables (where each neuron has its own V_i ,n_i ,Δ[K+]int_i ,[K+]g_i), for the mean-field reduction and mesoscopic comparisons we assume that the gating and potassium variables are shared across neurons. This assumption is consistent with prior work (e.g., Chen & Campbell) and is biophysically justified in the case of potassium due to its fast spatial equilibration in extracellular space. We also now mention this explicitly in the Limitations.

      (Q20) Continuity equation - Lines 568 - 597 - This part can be largely simplified and rewritten, as a matter of fact, the authors consider the gate variable n, the potassium concentrations as global (collective variables) depending on mean field values of <V> they can directly start from eq 20, by stating that they assume that the other variables (n, $\Delta[K^+]_{int}$, $[K^+]_g$) are collective variables, common to all the neurons, and that depends only on mean field variables as <V> or r. This has been done in many previous cases since the Ott-Antonsen Ansatz can be applied whenever the potential evolution is driven by quadratic terms and in the presence of mean field variables, the first indication of this was reported in 1993 by Watanabe and Strogatz for phase oscillators :

      [R12] Watanabe, Shinya, and Steven H. Strogatz. "Integrability of a globally coupled oscillator array." Physical review letters 70.16 (1993): 2391.

      Anyway, this approach has been previously employed to derive a neural mass model for networks of QIF neurons in the presence of various further neuronal variables (ranging from slow currents to plastic evolution of the couplings) describing more biologically realistic situations, see references [R1-R7] above. I strongly encourage the authors to reformulate their approach in a simpler and clearer manner, particularly interesting is for them the article [R6] by Guerriero et al, the authors examine exactly the same model as in Ref [95] [Chen, L. & Campbell, S. A. Exact mean-field models for spiking neural networks with adaptation. Journal of Computational Neuroscience 50 (4), 445-469 (2022)]. However, they solve the problem in a much more simple way, I encourage the authors to follow this approach.

      We thank the reviewer for the constructive suggestion. We acknowledge that, under the assumption that n, Δ[K+]int , and [K+]g are collective variables shared across the neuronal population, one could directly begin from Eq. (20) and proceed using the simpler approaches found in Guerriero et al. [R6] or related works [R1–R7]. However, we chose to retain the Chen & Campbell formalism, with additional clarification regarding the mesoscopic nature of the gatin variable, as it explicitly highlights the key approximations used in the derivation, which may be beneficial for readers seeking to extend the method. See also general response to reviewer 2 at the beginning.

      (Q21) Eq (26) -- I do not think the authors can estimate explicitly <n(t)> from the equation (26), as they do for the mean membrane potential and the firing rate. This is just a formal expression representing a collective variable, I do not think that <n> will coincide with the average of the values of n_i for each neuron. Please discuss this point, and in this case show that <n> indeed coincides with the average of all of the values of the single neuron gate variable n_i.

      We thank the reviewer for raising this important point. We agree that Eq. (26) is more formal than operational, as ⟨n(t)⟩ is not directly derived from the continuity equation in the same way as ⟨V⟩ or the firing rate r. Rather, it reflects our mean-field assumption that the gating variable evolves as a collective population-averaged quantity, governed by the dynamics of the average membrane potential. In our formulation, n is treated as a global variable shared across neurons, and thus ⟨n(t)⟩ effectively is the gating variable in the neural mass model — rather than the result of averaging heterogeneous n_i. We have clarified this distinction in the text to avoid suggesting that Eq. (26) provides an explicit estimate of microscopic gating dynamics.

      “Unlike the mean membrane potential ⟨V⟩ and the firing rate (r)>, which can be explicitly derived from the continuity equation under the Lorentzian assumption, the expression for ⟨n(t)⟩ in Eq. (26) is formal. In our mean-field model, the gating variable (n) is treated as a global population variable, evolving deterministically as a function of the average membrane potential. Therefore ⟨n(t)⟩ corresponds to the collective gating variable assumed to be shared by all neurons, and is not computed by averaging distinct microscopic (n<sub>i</sub>) values.”

      (Q22) Mean-field dynamics for the gating variable - All this sub-section is in my opinion not useful, if the authors assume from the beginning that <n(t)> is a global variable. Indeed in the end they write for <n(t)> the evolution equation Eq (30) which is the same equation as for the single neuron gate variable (1) but for the mean values of n and <V>. I suggest removing this sub-section.

      We thank the reviewer for this suggestion. We agree that, under the assumption that n is a global collective variable, the resulting equation for ⟨n(t)⟩\langle n(t) \rangle⟨n(t)⟩ is equivalent in form to the single-neuron gating equation, driven by the average membrane potential. However, we chose to retain this subsection to explicitly demonstrate how the gating dynamics enter into the mean-field formulation, especially for readers less familiar with this type of reduction. This step also mirrors the structure of the derivation used for other state variables in the model and maintains clarity for potential extensions where n may not be strictly global.

      (Q23) Line 696 - here an equation reference is lost.

      Thank you for pointing this out. We have corrected the text and restored the missing equation reference in the revised manuscript.

      (Q24) Eqs (36) -(37) -- Since the variables r and x entered in Eq (36) are essentially the same as Eq (25), apart from a constant R/pi, the use of two different names complicated in a useless manner an already complicated expression, Please decide to use everywhere r or x and then proceed consequently this applies also to Eq (37). This will also allow us to rewrite the equation in x or r in a more compact form.

      As noted in our response to Reviewer 1, point 14, we have revised Eq. (37) to ensure consistency in notation by replacing x with r throughout.

      (Q25) Eq (37) - This equation is written in a manner that is not careful enough, apart from that the authors are passed now from (x,y) to (pi*r/R,V) , therefore they should substitute everywhere x with r. Furthermore, the equation for the derivative of V is confusing, the authors should use the same approximate expression employed in eq (36) that makes explicit the quadratic dependence on V itself, otherwise, I believe that the equation is incorrect.

      In the same response to Reviewer 1, point 14, we also clarified the expression for \dot{V} in Eq. (37), we reintroduced the full current-based formulation (as in Eq. 16), reversing the quadratic approximation used earlier. This is now explicitly stated in the text, and we have improved the equation presentation to avoid confusion.

      (Q26) Eq (37) below line 708 - From this expression, it is clear that the gate variable n and the potassium variables are ruled exactly by the same equations as for the single neuron Eq (1) and that the Lorentzian Ansatz enter only in the rewriting of the evolution of the membrane potentials of the neurons in the network. In the end, the authors are doing exactly the same approximation made by many other authors [R1-R7], that these variables are collective, i.e. they are the same for all neurons, and in particular n=n(V) is a function of the mean membrane potential V. The mean field model that the authors derive corresponds to a microscopic model where the single neurons are heterogenous only in the intrinsic currents $\eta_i$, but they are all driven by collective variables, like n(V) and the potassium variables that are identical for all neurons. This should be clarified.

      We agree with the conclusion by the reviewer, and as seen through the previous responses, we now explicitly acknowledge the fact that n and the two slow variables are considered as a mesoscopic variables for the mean-field derivation, while for the spiking network, n remains microscopic.

    1. eLife Assessment

      This work by Pyne and Pandey et al. addresses DNase X (DNase1L1) activity at the macrophage phagocytic cup, using an innovative imaging approach that couples visualization of cup formation to spatially resolve DNA degradation. The methodology is technically sound, and the central finding that DNA digestion begins prior to phagolysosomal maturation is considered well supported, though some mechanistic claims may benefit from further evidence and more cautious framing. Overall, the study is solid and provides a valuable framework for investigating early events at the phagocytic cup that may shape responses to pathogens and inflammatory disease.

    2. Reviewer #1 (Public review):

      Pyne and Pandey et al. report the observation of early DNA degradation at the phagocytic cup during macrophage engulfment. Using an elegant experimental system that combines actin staining to visualise cup formation with direct monitoring of DNA degradation, the authors identify rapid recruitment of the membrane-bound nuclease DNase X (DNase1L1) to nascent phagocytic cups. This recruitment occurs within minutes of cup formation, is independent of DNA presence at the substrate, and appears to originate from intracellular membrane structures rather than from the extracellular environment. The results support the conclusion that DNase X activity is present at the phagocytic cup and that DNA digestion can begin prior to phagolysosomal maturation.

      The study is technically strong. The experimental system is clean, specific, and allows precise spatial and temporal detection of DNA degradation. The imaging-based approaches are carefully executed and enable convincing visualisation of DNase X recruitment and activity. The use of an alternative substrate beyond the primary SNS system strengthens the core observation, and the data broadly support the authors' central claim.

      However, several limitations temper the physiological interpretation. The system relies largely on short, free DNA substrates, leaving open how efficiently DNase X processes more complex or physiologically relevant DNA structures, such as nucleosome-bound DNA or neutrophil extracellular traps (NETs). It remains unclear whether DNase X deficiency would alter macrophage responses to larger nucleic acid structures, influence engulfment efficiency, or modify downstream inflammatory signalling pathways such as TLR9 or STING activation. Moreover, the experimental setup prevents full phagocytic cup closure, potentially prolonging DNase activity compared with physiological phagocytosis, which typically proceeds rapidly to cargo internalisation. For example, the peak signal observed in Figure 5 occurs approximately 90 minutes after phagocytic cup formation, a time point at which many phagocytic cups would be expected to have already closed under physiological conditions. Additional work using fully engulfed cargo in more physiological contexts would clarify whether early DNase X activity meaningfully contributes to overall DNA clearance kinetics.

      Mechanistically, the signal that triggers DNase X recruitment remains unresolved. Although actin rearrangement was excluded as the primary driver, the upstream cues that direct DNase X-containing membrane structures to the forming cup are not yet defined.

      In the broader context, early DNase X activity at the phagocytic cup could represent an additional safeguard against inflammatory signalling by limiting extracellular or surface-associated DNA before phagolysosomal degradation by DNase II. This mechanism may be particularly relevant in settings where DNA fragmentation before engulfment is incomplete, such as necroptosis or NET formation. Determining whether DNase X deficiency exacerbates inflammatory responses, alters DNA clearance efficiency in vivo, or contributes to immune pathology will be critical for establishing its physiological and disease relevance.

      Overall, this is a compelling study that introduces a novel concept of pre-phagolysosomal DNA digestion. The conclusions are well supported within the in vitro system used, but further investigation using diverse DNA substrates and physiologically relevant models will be required to fully define the impact of this mechanism on immune regulation and disease.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents an elegant and innovative imaging approach to visualize DNase activity at the interface between macrophages and extracellular substrates. The platform is technically strong and enables the study of localized DNA degradation with high spatial resolution. The work is of clear interest and provides a useful framework to investigate how immune cells process extracellular DNA. However, several aspects of the mechanistic interpretation and conceptual framing would benefit from clarification.

      Strengths:

      (1) The study introduces a creative and well-designed imaging platform that allows visualization of localized DNase activity at cell-substrate interfaces.

      (2) The approach is technically robust and represents a valuable tool that could be broadly useful to the field.

      (3) The experiments are thoughtfully designed and address an important question regarding how immune cells interact with extracellular DNA.

      (4) The work opens interesting avenues for studying DNA processing in contexts such as infection and inflammation.

      Weaknesses:

      While the experimental approach is strong, several key conclusions rely on interpretations that would benefit from further clarification:

      (1) First, the conclusion that DNaseX is recruited to phagocytic cups from the "cytoplasm" appears conceptually imprecise. Given that DNaseX is a membrane-anchored protein, it is unlikely to exist as a freely soluble cytoplasmic pool. A more plausible interpretation is that DNaseX is supplied from intracellular membrane compartments. This interpretation would also be more consistent with the data showing dependence on a membrane anchor.

      (2) Second, the interpretation that actin polymerization is not required for DNaseX recruitment raises concerns. Phagocytic cup formation is known to depend strongly on actin dynamics, and it is therefore unclear whether the structures observed under actin inhibition represent fully formed functional cups or partial cell-substrate contacts. This distinction is important for interpreting recruitment versus activity, particularly since enzymatic activity is reduced under these conditions.

      (3) Third, the identification of DNaseX as the main nuclease responsible for the observed activity is not fully resolved. The conclusions rely primarily on gene silencing and staining approaches, but the specificity of these strategies relative to other nucleases is not addressed. It therefore remains possible that additional enzymes contribute to the observed activity.

      (4) Finally, the interpretation of the biofilm experiments may be overstated. While the data clearly show localized DNA degradation in contact with macrophages, it is not fully established that this process depends specifically on phagocytic cup structures. An alternative explanation is that membrane-associated DNase activity more generally mediates this effect. In addition, the physiological relevance of this mechanism would benefit from further discussion.

      Overall, the study is technically strong and introduces a valuable methodology, but several central conclusions are only partially supported by the current data and would benefit from more cautious interpretation and clearer conceptual framing.

    1. eLife Assessment

      This important study combines single-molecule imaging and CUT&TAG to address the molecular mechanism underlying the differentiation process that initiates the formation of red blood cells in the bone marrow. The authors provide evidence that the transcription factor GATA2 transiently associates with a new set of genomic loci early in the differentiation process before it is replaced by GATA1. Together, the experiments across three biological systems are solid, but they could benefit from additional details and controls to strengthen the conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      During erythroid differentiation, hematopoietic progenitors relinquish multipotency and activate lineage programs. The switch from GATA2 to GATA1 is particularly important in this process, yet GATA2 chromatin‑binding kinetics remain undefined. The authors investigated GATA2-chromatin interaction dynamics during erythroid differentiation in three different cell systems using single‑molecule live‑cell imaging, and they also used CUT&Tag to profile GATA2 chromatin occupancy.

      By single‑molecule imaging, the authors report two interaction modes for GATA2: short‑lived (<1 s) and long‑lived (>5 s) binding. The proportion of long‑lived molecules, the number of binding events, and the duration of long‑lived binding change (or are maintained) during differentiation. Notably, long‑lived chromatin engagement by GATA2 increases during early erythroid differentiation and decreases at the late stage. CUT&Tag identifies regulatory elements selectively occupied by GATA2 during the early transition stage. Together, these results support a model in which transcription factor kinetics form a dynamic chromatin‑engagement profile that characterizes the GATA2‑to‑GATA1 transition.

      Strengths:

      (1) Characterizing transcription‑factor binding kinetics during the GATA2->GATA1 transition addresses a fundamental mechanism in erythroid differentiation.

      (2) Combining single‑molecule live imaging with CUT&Tag provides both dynamic and locus‑specific perspectives.

      (3) Single-molecule analysis across three different cell systems strengthens the potential generalizability of the findings and highlights biological variability.

      Weaknesses:

      I agree that single‑molecule imaging is a powerful approach for investigating GATA2 kinetics, but the single‑molecule data are the most important part of the paper and need improvement. The analyses focus on three measures: (i) duration of long binding, (ii) proportion of short‑ and long‑binding molecules, and (iii) total binding events. However, several methodological and control issues limit confidence in the kinetic interpretations. The authors should address the following major concerns.

      (1) Two binding states: justification and controls

      The authors propose two states of GATA2 binding. Are there only two states? Studies that separate short‑ and long‑lived binding (e.g., Chen et al., 2014, PMID: 25342811) address two states of transcriptional factors very carefully. Some long‑binding duration distributions here are very long‑tailed (e.g., Figure 2D middle), suggesting a possible third state. The authors must explain how they determined that two states provide the "best fit" to the data and how they classified "short" versus "long" binding.

      Controls should be included for long‑lived and short‑lived binding (e.g., histone proteins, HaloTag‑NLS, or a binding‑deficient GATA2 mutant) as in other studies. These controls are essential to exclude alternative explanations (see points below).

      (2) Exclude photophysical and focal‑plane artifacts

      The authors should exclude contributions from (i) photobleaching, (ii) blinking, and (iii) Z‑axis motion (disappearance from the focal plane). Although photobleaching correction is mentioned in the Methods, no details are provided. Describe and quantify the photobleaching correction and demonstrate that it was applied across all cell types and conditions.

      Some spots in the supplementary movies appear to blink or to move substantially between frames. Provide analyses or controls that distinguish true dissociation events from photophysical blinking/bleaching or axial motion.

      (3) HILO illumination and nuclear region sampled

      HILO is powerful but sensitive to illumination angle: slight changes sample different nuclear regions (e.g., nuclear interior versus periphery). The nuclear periphery is enriched in heterochromatin and may bias binding statistics. Explain how the authors controlled the HILO angle and confirmed that comparable nuclear regions were imaged across cells and conditions.

      (4) Quantification of event counts and long‑binding durations

      The number of binding events and measured long‑binding durations are strongly affected by imaging conditions (labeling/staining, bleaching, nucleus size, cell cycle state, focal plane, spot detectability, etc.). Imaging clarity appears to differ among cells/conditions in the supplementary movie. Provide more careful analysis describing how these variables were controlled or corrected for, and assess the sensitivity of results to choices in detection and tracking parameters.

      (5) Evidence that spots are single molecules

      The authors state that spots represent single molecules but do not provide supporting evidence. Spot brightness varies considerably in the movies. Brightness differences may reflect axial position. Provide evidence supporting single‑molecule assignment (e.g., single‑step photobleaching traces, brightness distributions compared to a known single‑molecule control, or photon count analysis).

      (6) Description of spot‑analysis pipeline

      The manuscript lacks a sufficient description of the spot‑analysis method. I reviewed the STRAP pipeline paper cited (Haque and Coleman 2025 bioRxiv) and the GitHub code, but the Methods in the current manuscript should include a detailed STRAP pipeline. This would enable readers to evaluate and reproduce the analyses.

      (7) Differences among cell systems

      The three cell systems yield notably different results (e.g., Figure 2C vs 4C and Figure 2D/3D vs 4D). Provide a more detailed explanation for these differences and discuss how biological variability, technical differences, or imaging biases might account for the discrepancies.

    3. Reviewer #2 (Public review):

      In this study, the authors address the molecular mechanism underlying the transcriptional changes during erythroid differentiation from hematopoietic progenitor cells. The authors combine single-molecule live cell imaging and CUT&RUN to analyze the chromatin binding properties of the GATA2 transcription factor prior to and after initiation of differentiation into the erythroid cell lineage. Using three distinct cellular systems, the authors demonstrate that the chromatin binding of GATA2 is transiently increased early in the differentiation process, as evidenced by increased chromatin binding residence time and the emergence of new genomic binding sites identified by CUT&RUN. The strength of the study lies in the combination of single-molecule imaging, which reports on binding dynamics but is agnostic of the binding site, with CUT&RUN, which reports on the binding sites but does not provide dynamic information. The authors clearly demonstrate that chromatin binding of GATA2 is altered early in the differentiation process and is later displaced as cells switch to expression of GATA1, which has been previously observed. The use of three distinct cell lines, in particular the GATA2-SNAP mouse model, is a strength in principle; however, the results are not fully consistent between the different cell systems. A key difference is that the G1E-ER4 and HPC7 cell line models express HaloTagged GATA2 in addition to the endogenous GATA2 protein. The authors go through great lengths to control GATA2-HaloTag expression levels, but they use polyclonal cell lines and do not analyze expression levels of the GATA2-HaloTag transgene, which is a key variable in interpreting their experimental results. Finally, a key variable determined in their single-molecule analysis is the number of binding events observed during the distinct differentiation changes. The number of binding events observed is influenced by the expression level of the tagged protein, which in turn is controlled by the Shield-1 ligand, and the fraction of molecules labeled with the HaloTag ligand. Since transgene protein levels and the labeling efficiency were not determined, it is hard to assess how reliable the measurements of the number of binding events are across all cell lines.

      To address the weaknesses summarized above the authors could take the following steps:

      (1) Determine the expression levels of the GATA2-HaloTag transgene over the course of differentiation under the conditions used for single-molecule imaging. This will not only allow them to determine the expression of the transgene but also the endogenous untagged protein with which the GATA2-HaloTag fusion proteins compete for binding sites.

      (2) To determine the fraction of molecules labeled during imaging, the authors could carry out a titration of the HaloTag ligand and compare the amount of labeled protein under single-molecule imaging conditions to that of saturating labeling of the HaloTag. This approach will ensure that the number of labeled molecules per cell is comparable across experimental conditions and allow the authors to draw more solid conclusions regarding the number of binding events.

      (3) The analysis of residence times using single-molecule imaging requires robust single-particle tracking without gaps or interruptions of trajectories. The authors should show images of their particle trajectories to demonstrate that their tracking is robust. Or even better, movies superimposing the trajectories onto the imaging data.

    4. Reviewer #3 (Public review):

      Hobbs et al. use live-cell single-molecule tracking (SMT) of HaloTag- and SNAP-tagged GATA2 combined with CUT&Tag chromatin profiling to examine how GATA2 chromatin engagement evolves during erythroid differentiation. Across three complementary systems, G1E-ER4 cells, HPC7 cells, and primary bone marrow progenitors from a new Gata2-SNAP knock-in mouse, they report a transient strengthening of long-lived GATA2 chromatin binding at the "Early" (2 h) erythroid stage, manifested either as increased residence time (G1E-ER4) or expansion of the long-lived bound fraction (HPC7, primary cells). CUT&Tag identifies 1,167 Early-restricted GATA2 peaks partitioning into GATA2-only (promoter-proximal, GATA/RUNX motifs) and GATA2+GATA1 co-bound (distal, GATA/E-box motifs) subclasses. The authors propose that this kinetic phase represents a previously unappreciated dimension of the GATA switch.

      This is a strong study with a genuinely novel finding-the non-monotonic kinetic behavior of GATA2 during erythroid priming, supported by complementary measurements in three biological systems. The issues below are largely clarifications, additional analyses of existing data, and modest refinements to the discussion. With these addressed, the manuscript will make a valuable contribution. I recommend a minor revision.

      Specific points:

      (1) Clarify the photobleaching correction and report per-cell bleach lifetimes.

      The long-lived residence time claim in G1E-ER4 cells depends on careful accounting for photobleaching, which the Methods indicate was done via a right-censoring model. For reviewer and reader confidence, the authors should report the per-stage (or per-cell distribution of) photobleaching lifetimes and the photobleach-corrected residence time values alongside the apparent values in Figure 2D. If feasible, including a brief supplementary analysis with an H2B-Halo or similar long-lived control under matched conditions would further solidify the quantitative claims. This is an analysis of existing data and should not require new imaging.

      (2) Unify or explicitly discuss the mechanistic differences across systems.

      The three systems show qualitatively different signatures: residence time change in G1E-ER4, bound fraction expansion in HPC7, and primary cells. The authors currently group these under "enhanced engagement," but these signatures imply different underlying mechanisms (koff decrease vs. increased kon or increased specific-binding-competent pool). The Discussion partially addresses this by noting engineered vs. native differences, but a more explicit framing in both Results and Discussion would help readers. Specifically, reporting an on-rate proxy (for example, binding events per unit time normalized to detectable molecule number) alongside koff would let readers see how the mechanistic pieces fit together. I do not think this changes the central message; it sharpens it.

      (3) Per-cell GATA2 concentration would strengthen the "uncoupling" claim.

      A central claim of the Figure 6 model is that chromatin engagement is uncoupled from protein abundance. The ectopic Shield-1 stabilization system is a reasonable design choice, but quantifying total nuclear GATA2-Halo signal (for example, from the pre-bleach frame or a brief high-power acquisition) on a per-cell basis across stages would directly support the interpretation. For the primary cells, where the biological claim is strongest, a western blot or quantitative immunofluorescence on the flow-sorted populations would make the uncoupling argument much more defensible. I recognize this may be one additional experiment, but it is a high-value one.

      (4) Additional single-cell distribution analysis.

      Figure 1E and Figures 2 to 4 show substantial cell-to-cell heterogeneity, and the Early populations in particular look potentially bimodal. Given that the authors cite Wheat et al. and Palii et al. on probabilistic hematopoietic transitions, a brief supplementary analysis using distribution-based statistics (K-S test, or mixture model) rather than, or alongside, mean-based ANOVA would align the analysis with this conceptual framing and may reveal whether the Early state represents a subpopulation transition rather than a uniform shift. This is purely an analysis of existing data.

      (5) Quantitative integration of CUT&Tag with SMT.

      The manuscript presents SMT and CUT&Tag as complementary but does not attempt to quantitatively connect them. A back-of-the-envelope calculation of whether a 21% increase in residence time (G1E-ER4), or the fraction expansion in other systems, is consistent with the acquisition of the 1,167 Early-restricted sites, given plausible site affinities, would substantially strengthen integration. Even if the calculation is approximate, framing it explicitly would help readers appreciate that the two datasets reinforce each other.

      (6) Short-lived kinetic interpretation and tracking parameters.

      The 1.5 s gap allowance in tracking is long relative to the 0.55 to 0.73 s short-lived residence times reported in primary cells (Figure Supplement 1F), which could affect the interpretation of the "slowing of target search" claim. A brief sensitivity analysis with tighter gap parameters in the supplement would reassure readers that this effect is robust. Additionally, please clarify how the inferred slowing of search, which should reduce kon, is reconciled with the increased number of binding events per cell at the Early stage.

      (7) CUT&Tag peak definition.

      The Early-restricted peak set is defined by presence and absence at q less than 0.01, which can be sensitive to near-threshold peaks. Please report either (a) the CUT&Tag signal intensity distribution at the 1,167 sites across all three stages as a quantitative scatter or density plot, beyond the heatmap in Figure 5C, or (b) the result of a differential binding analysis (for example, DESeq2 on read counts in a union peak set) as a supplementary confirmation. Please also state the number of CUT&Tag replicates per stage and the overlap of Early-restricted sets across replicates.

      (8) Knock-in mouse validation.

      The Gata2-SNAP allele is a valuable new tool, and it would benefit from slightly more quantitative validation in the supplement. A brief characterization of basic hematopoietic parameters in homozygotes (CBC, LSK/HSPC frequencies, or colony assays) would confirm that the tagged allele is truly physiological and would serve the community that will want to use this mouse going forward. If this has been done, please include it; if not, a statement about what was checked would suffice.

    5. Author response:

      We are writing to provide our provisional response to the public reviews. We note that the reviewers’ comments focus primarily on strengthening technical rigor and quantitative interpretation. We have designed the planned revisions to directly address the reviewers’ major concerns and to strengthen the study’s evidentiary basis. We plan to submit a revised manuscript for the final Version of Record.

      For clarity, we summarize below the major new experiments and analyses that address the reviewers’ primary concerns:

      (1)Validation of Tracking Parameters (Reviewers 1 & 3): We will re-analyze our single molecule tracking data with tighter gap-time allowances (0 seconds) to demonstrate the robustness of our interpretations of short- and long-lived kinetics. We will also generate a supplementary movie with binding trajectories superimposed directly on detected molecules to visually confirm tracking robustness.

      (2) Photobleaching & Two-State Controls (Reviewers 1 & 3): We will report per-cell photobleaching lifetimes derived from our global fluorescence decay. To strengthen this analysis, we will include supplementary measurements using a H2B-HaloTag control under matched imaging conditions and perform single-molecule tracking of GATA2 zinc-finger deletion mutants (N-terminal, C-terminal, and double) as a binding-deficient functional control.

      (3) Protein Expression & Labeling Efficiency (Reviewers 1 & 2): To address concerns about transgene expression and competition with endogenous proteins, we will quantify Halo-GATA2 levels in G1E-ER4 and HPC7 cells and SNAP-GATA2 levels in primary cells using standardized titration methods with established Halo-CTCF and SNAP-RPB1 reference systems.

      (4) Integration of SMT and CUT&Tag (Reviewer 3): We have conducted a quantitative foldchange analysis of our existing CUT&Tag dataset to complement our single-molecule kinetics.

      However, as detailed in our specific response below (R3 point 5), we emphasize that directly integrating population-level genomic occupancy measurements with single-cell kinetic measurements is not straightforward. We will therefore frame the relationship between these datasets as a conceptual consistency check rather than a strict quantitative integration. This quantitative analysis supports and refines the Early-restricted peak set, identifying a high confidence strict subset consistent with the broader presence/absence-defined set described in Figure 5 of the manuscript (see Author response images 1–3 and our response to R3 point 7).

      (5) Characterization of the GATA2-SNAP Mouse (Reviewer 3): We have characterized hematopoietic populations in the homozygous knock-in mouse, including lymphoid (CD3<sup>+</sup>/CD4<sup>+</sup>/CD8<sup>+</sup>/B220<sup>+</sup>/CD19<sup>+</sup>), myeloid (CD11b<sup>+</sup>/Gr1<sup>+</sup>), and erythroid (Ter119<sup>+</sup>) compartments. These data, presented in Author response image 4, indicate that normal mature hematopoietic output is preserved across genotypes. Statistical caveats are described in the corresponding figure legend and in our response to R3 point 8.

      Public Reviews:

      Reviewer 1 (Public review):

      (1) Two binding states: justification and controls

      The authors propose two states of GATA2 binding. Are there only two states? Some longbinding duration distributions here are very long-tailed (e.g., Figure 2D middle), suggesting a possible third state. The authors must explain how they determined that two states provide the best fit and how they classified short versus long binding. Controls should be included for long-lived and short-lived binding (e.g., histone proteins, HaloTag-NLS, or a binding-deficient GATA2 mutant).

      Agreed in part; we will attempt the requested binding-deficient control using existing GATA2 deletion constructs, complemented by GRID and H2B-HaloTag controls.

      We will clarify that the two-state framework is an operational model rather than a claim that GATA2 can occupy only two physical states. This approach is widely used in SMT studies of chromatin-associated transcription factors and transcription machinery (Gebhardt et al., 2013; Liu et al., 2014; Hansen et al., 2017; Kenworthy et al., 2022). In particular, Ling et al. (Science, 2026) recently used two-exponential survival-probability fitting across 58 Halotagged transcription-associated proteins to distinguish transient and stable chromatin-binding populations, while explicitly noting that the simplified two-state model provides a tractable framework even when the underlying physical behavior may be more heterogeneous.

      We agree that our current two-state model may under-represent the diversity of GATA2 chromatin-binding populations in single cells. However, even within this simplified framework, the existing analysis already indicates increased upper-tail dispersion of kinetic measurements (e.g., residence time and/or percentage of stable events) at the single-cell level in early erythroid cells. To support the goodness-of-fit metrics from our two-state fitting, as Reviewer 3 recommends, we will provide a supplementary table containing confidence intervals for the rate parameters and an F-test metric describing the differences between one- and two-state fits.

      To determine whether additional binding states exist, we will perform GRID (Genuine Rate Identification from Distributions), which does not bias the model toward a particular number of states and, in our experience across multiple proteins, yields fits with 3-5 binding populations. However, we have found that in many cases, GRID requires aggregating binding events from multiple cells to achieve consistently robust fits for the populations of relatively rare, long-lived (>~30 sec) binding events. Therefore, GRID will assess whether additional populations exist, but we will lose the ability to analyze changes in the cell populations at the single-cell level.

      We will include the multi-state analysis as a new supplementary figure. We will additionally clarify in the Results and Methods exactly how short- and long-lived binding events are classified (1-second threshold consistent with prior single-molecule frameworks for transcription-factor chromatin interactions; Gebhardt et al., 2013; Liu et al., 2014; Kenworthy et al., 2022) and direct the reviewer to these passages.

      For the requested controls, we will include H2B-HaloTag imaging under matched conditions as a long-lived reference for both photobleaching correction and as a positive control for stable chromatin association, addressing R1 point 2 and R3 point 1 simultaneously.

      We will also attempt to address the reviewer’s request for a binding-deficient control. We have lentiviral constructs in hand that encode GATA2 with a C-terminal zinc-finger deletion (which removes the primary DNA-binding domain), an N-terminal zinc-finger deletion, and a double deletion. We will perform single-molecule tracking of these mutants in the engineered cell systems and test whether removing GATA2’s specific DNA-binding capacity produces the predicted reduction in long-lived chromatin engagement, providing a functional perturbation control. The interpretation of these experiments will depend on the mutants expressing and localizing appropriately, which we will validate before drawing kinetic conclusions. We note that an analogous binding-deficient mutant cannot be examined in the physiological context of the Gata2SNAP knock-in mouse, and we will frame the cell-line mutant analyses accordingly. Together with GRID and the H2B-HaloTag control, these mutants provide complementary lines of validation for the two-state kinetic framework.

      (2) Photophysical and focal-plane artifacts

      The authors should exclude contributions from (i) photobleaching, (ii) blinking, and (iii) Z-axis motion. Describe and quantify the photobleaching correction. Provide analyses or controls that distinguish true dissociation events from photophysical blinking/bleaching or axial motion.

      Agreed.

      We will substantially expand the methodological description and provide three new pieces of supplementary analysis:

      - Photobleaching: A per-cell photobleaching-rate distribution will be plotted for each cell type and differentiation stage, and photobleach-corrected residence-time values will be reported alongside apparent values in the relevant figures. We will also perform H2B-HaloTag imaging under matched illumination, exposure, and dye conditions in each cell line as a longlived chromatin-bound reference, establishing per-cell-type bleach lifetimes to which the GATA2 measurements can be referenced. This approach follows recent SMT precedent in which H2B decay was used to correct residence-time measurements for photobleaching, chromatin and nuclear motion, microscope drift, defocalization, and dye photophysics (Ling et al., Science 2026). The right-censoring photobleach-correction model used in our analysis will be described in detail in the revised Methods, including parameter values and per-cell handling.

      - Blinking: The STRAP single-particle tracking pipeline already accommodates fluorophore blinking when linking trajectories across successive frames, following the multiple-targettracing framework of Sergé et al. (Nature Methods, 2008). This use of short gap-frame allowances to avoid artificially splitting trajectories due to fluorophore blinking or transient defocalization is consistent with recent live-cell SMT studies of chromatin-associated factors (Ling et al., Science 2026). We will add an explicit statement to the Methods describing how blinking-tolerant linkage parameters are set, and we will reanalyze representative datasets

      with stricter maximum off-frame settings to ensure this parameter does not drive our conclusions (also addressing R3 point 6).

      - Z-axis motion: Given our 500-ms exposure and the ~500-nm axial detection range of the HiLo configuration, axial loss is expected to be a minor contributor. We will quantify this indirectly by plotting, as a supplementary analysis, the maximum in-plane 2D spatial exploration of each binding trajectory, defined as the long-axis diameter of the 2D trajectory envelope. Although this does not directly measure z-position, it serves as a control for large apparent displacements that could reflect molecules moving out of the HiLo detection volume and demonstrates that observed dissociation events are not dominated by axial drift.

      Representative photobleaching traces from individual cells (lowest, highest, and median bleach rates) will be included to support the single-molecule interpretation (also addresses R1 point 5).

      (3) HILO illumination and nuclear region sampled

      HiLo is sensitive to illumination angle: slight changes sample different nuclear regions. Explain how the HiLo angle was controlled and confirmed comparable across cells and conditions.

      Agreed.

      We will add a Methods subsection describing our HiLo illumination procedure. In brief, we started at a TIRF-supercritical angle and reduced it toward epifluorescence just enough to achieve high imaging depth while minimizing out-of-focus background signal. Within each biological system (cell line or primary cells), the TIRF angle was held constant across Basal, Early, and Late conditions to ensure direct comparability of kinetic measurements across stages.

      (4) Quantification of event counts and long-binding durations

      The number of binding events and the duration of long-binding events are influenced by imaging conditions. Provide a more detailed analysis of how these variables were controlled and assess the sensitivity of the results to detection and tracking parameters.

      Agreed.

      We will (i) normalize per-cell binding-event counts to nuclear cross-sectional area (extracted from the segmented nuclear masks already in the STRAP pipeline) to control for differences in nuclear size; (ii) report the tracking-parameter sensitivity sweep described above; and (iii) confirm in the revised Methods that all imaging conditions (laser power, exposure, dye concentration, sample preparation) were held constant across stages and cell types, consistent with the existing manuscript text. Per the Reviewing Editor’s guidance, the planned labeling-efficiency and absolute-molecule-quantification experiments will further constrain the interpretation of binding-event counts across conditions.

      (5) Evidence that spots are single molecules

      Provide evidence that spots represent single molecules.

      Agreed.

      We will include a small number of per-event intensity traces from our STRAP tracking output, selected to illustrate the single-step photobleaching behavior characteristic of single-molecule emission (intensity remains approximately constant during the binding event and then drops to background in a single step). The nuclear-fluorescence measurements from the planned labeling-titration experiment will also allow us to confirm that bound-spot densities are consistent with single-molecule occupancy at the labeled fraction used for tracking.

      (6) Description of the spot-analysis pipeline

      The Methods should include a detailed STRAP pipeline description.

      Partially agreed; the existing STRAP reference is appropriate, but the Methods will be expanded.

      STRAP (Haque & Coleman, 2025) is a consolidated, automated implementation of two well-established, previously published frameworks: SLIMfast / multipletarget tracing (Sergé et al., 2008) and evalSPT (Normanno et al., 2015), both of which are cited in the original manuscript. We will expand the Methods to describe the parameter set used in our analysis (detection thresholds, linking radii, gap-frame allowance, photobleaching correction model) so that readers can assess the analysis without referring exclusively to the STRAP manuscript and code repository, while preserving the cited STRAP reference for the full algorithmic description. We respectfully suggest that a complete pipeline description duplicating Haque & Coleman (2025) would not be appropriate in a primary research article.

      (7) Differences among cell systems

      The three cell systems yield notably different results. Provide a more detailed explanation for these differences.

      Agreed.

      We will also explicitly describe the caveats of the engineered systems versus the native GATA2-SNAP primary-cell system, in which endogenous GATA2-SNAP remains under physiological regulation. Specifically, we will discuss how variables such as the GATA1null background, ectopic forced nuclear import of GATA1-ERT, and ectopic GATA2-Halo in G1E-ER4 cells, as well as ectopic GATA2-Halo, endogenous GATA1, and cytokine signaling in HPC7 cells, likely contribute to the observed differences in signatures.

      Reviewer 2 (Public review):

      (1) Expression levels of the GATA2-HaloTag transgene

      Determine the expression levels of the GATA2-HaloTag transgene over the course of differentiation under the conditions used for single-molecule imaging.

      Agreed.

      This is the central concern flagged by the Reviewing Editor. For each cell line (G1E-ER4 and HPC7), we will (i) measure total nuclear GATA2-Halo fluorescence per cell under matched acquisition conditions and (ii) convert this fluorescence intensity to absolute molecules per cell using a Halo-CTCF/U2OS reference standard (Cattoglio et al., 2019; absolute CTCF abundance quantification applied previously by our group). This will provide per-cell GATA2Halo molecule counts at each differentiation stage (Basal, Early, Late). For the primary GATA2SNAP cells, we will perform the analogous comparison against a SNAP-RPB1/U2OS standard.

      (2) Fraction of molecules labeled

      Carry out a titration of the HaloTag ligand and compare the amount of labeled protein under single-molecule imaging conditions to that of saturating labeling.

      Agreed.

      We will perform HaloTag-ligand and SNAP-tag-ligand titrations in each cell type, comparing nuclear fluorescence under the limiting-label conditions used for single-molecule tracking with that under saturating labeling. This will yield a per-cell-type labeled fraction and allow us to confirm that comparisons of binding-event counts across conditions are not confounded by differences in labeling efficiency. The labeled-fraction values will be reported in a new supplementary figure and incorporated into our quantification of binding-event rates.

      (3) Robust single-particle tracking

      Show images of particle trajectories or movies superimposing trajectories on imaging data.

      Agreed.

      We will generate visualizations of selected long-lived binding events with single-particle trajectories overlaid on the imaging data — using a multi-frame color overlay (e.g., five sequential frames in distinct colors superimposed) so that linkage of the spot across frames is visually unambiguous — and include them as a new supplementary figure or movie. Examples will be drawn from each cell system to demonstrate consistent tracking quality.

      Reviewer 3 (Public review):

      (1) Photobleaching correction; per-cell bleach lifetimes

      Report the per-stage (or per-cell) photobleaching lifetimes and the photobleachcorrected residence time values alongside apparent values, ideally with an H2B-Halo control.

      Agreed.

      Addressed by the photobleach-rate distribution and H2B-HaloTag control analyses described under R1 point 2. The supplementary figure will explicitly compare per-cell bleach lifetimes across stages, report photobleach-corrected residence-time values alongside apparent values and include H2B-HaloTag controls under matched conditions in each cell line.

      (2) Mechanistic differences across systems

      The three systems show qualitatively different signatures: residence time change in G1EER4, bound fraction expansion in HPC7 and primary cells. Reporting an on-rate proxy alongside k_off would help.

      Agreed.

      Addressed by the cross-system kinetic framing described under R1 point 7 and by the GRID state-spectrum analysis described under R1 point 1. We will explicitly frame the three systems in terms of underlying kinetic mechanism in both Results and Discussion, following the conceptual distinction emphasized by Ling et al. (Science 2026) in which residence time reports binding stability once engaged, whereas changes in bound fraction or event frequency can indicate altered association/recruitment efficiency. In this framework, the G1E-ER4 residencetime signature is consistent with reduced dissociation (a longer-lived bound state), while the longlived-fraction expansion in HPC7 and primary cells is consistent with an increased target-search efficiency or specific-binding-competent pool. Alongside the GRID-derived state-spectrum analysis, we will report an apparent engagement-rate proxy calculated as binding events per unit imaging time normalized to detectable molecule number; this proxy is an approximation, not a direct k_on measurement, as accurate determination of k_on from single-molecule tracking requires concentration-dependent on-rate experiments that are outside the scope of the present study. We thank the reviewer for this suggestion, which we agree sharpens rather than alters the central message.

      (3) Per-cell GATA2 concentration and the uncoupling claim

      Quantify total nuclear GATA2-Halo signal per cell across stages; for primary cells, a western blot or quantitative immunofluorescence on flow-sorted populations would make the uncoupling argument more defensible.

      Agreed.

      For the cell lines, the per-cell nuclear GATA2-Halo quantification described in our response to R2 point 1 addresses this point.

      For primary cells, where the biological claim is strongest, we will exploit the endogenous Gata2SNAP knock-in itself as a quantitative reporter of total GATA2 protein. Specifically, we will label flow-sorted CD71/Ter119 populations from Gata2-SNAP mouse bone marrow with SNAP-Cell 647-SiR at saturating concentration in a parallel acquisition to the limiting-label single-molecule tracking experiment. Total nuclear SNAP-GATA2 fluorescence at saturating labeling provides a measure of endogenous GATA2 abundance per cell at each erythroid stage, in the same chemistry used for our single-molecule measurements, and will be benchmarked against a SNAPRPB1/U2OS reference standard for absolute molecule counting. This approach (i) measures the protein of interest in the labeling chemistry already established in this study; (ii) avoids reliance on quantitative immunofluorescence, which we have not been able to validate under our flowsorted-cell conditions; and (iii) extends the same analytical framework — saturating versus limiting labeling, with U2OS reference standards — across cell lines and primary cells. Quantitative western blotting on flow-sorted populations remains an alternative we will consider if specifically requested by the reviewers.

      (4) Single-cell distribution analysis

      Distribution-based statistics (K-S test, mixture model) rather than (or alongside) meanbased ANOVA, particularly for the Early populations, which look potentially bimodal.

      Agreed.

      We will perform Kolmogorov–Smirnov and Gaussian mixture model analyses of the single-cell long-lived fraction and residence-time distributions across stages, reporting these alongside the existing Welch ANOVA results in a new supplementary figure. This analysis is consistent with the conceptual framework cited in the manuscript (Wheat et al., 2020; Palii et al., 2019) for probabilistic hematopoietic transitions and may reveal subpopulation structure underlying the Early-stage signal. The GRID analysis further complements this by formally testing whether multi-state mixture models are statistically preferred at each stage. However, GRID analysis requires aggregating binding events across cells, which limits our ability to monitor changes in population dispersion at the single-cell level.

      (5) Quantitative integration of CUT&Tag with SMT

      Attempt a back-of-the-envelope calculation of whether the residence-time or fraction changes are quantitatively consistent with the acquisition of the 1,167 Early-restricted sites.

      Partially agreed; will attempt an order-of-magnitude framing.

      We thank the reviewer for this thoughtful suggestion. We agree that more explicit framing of the quantitative relationship between the two datasets will strengthen the integration. We will add a paragraph to the Discussion presenting an order-of-magnitude calculation linking the observed residence-time and long-lived-fraction changes to the steady-state occupancy increase predicted at competent regulatory sites, with explicit caveats regarding (i) the inherently semi-quantitative nature of CUT&Tag signal and (ii) the assumptions required to translate population-averaged occupancy into the genome-wide site count observed. For the G1EER4 cells, we observe relatively minor shifts in population-mean behavior as single-cell dispersion increases. Therefore, it may be difficult to directly link population-based measurements (e.g. CUT&Tag) with single-cell kinetic measurements (SPT). This distinction between occupancy and dynamics is consistent with recent systematic SMT analysis of the eukaryotic transcription machinery, in which factors appearing persistently associated in ensemble genomic assays were shown to exchange on second-scale timescales in living cells (Ling et al., Science 2026), emphasizing that population genomic occupancy and single-molecule residence time are complementary but not directly interchangeable measurements. Closing this gap rigorously is a major hurdle for the field and will require substantial technology development on quantitative single-cell CUT&Tag occupancy measurements. We will therefore frame our analysis as a consistency check rather than a strict quantitative integration. The reviewer notes that this analysis “does not change the central message; it sharpens it,” and we agree.

      (6) Short-lived kinetic interpretation and tracking parameters

      The 1.5 s gap allowance is long relative to the short-lived residence times in primary cells. A sensitivity analysis with tighter gap parameters would help. Also clarify how slowing of search reconciles with increased binding events at Early.

      Agreed.

      Addressed by the tracking-parameter sensitivity analysis described under R1 point 2. We apologize for the lack of clarity in our original description of the gap allowance. Our current maximum off-frame parameter is set to 2 frames, corresponding to a 0.5-s gap allowance. We will rerun the tracking analysis on representative datasets using a maximum off-frame parameter of 1, corresponding to no missed frames, and will report the resulting residence-time distributions alongside the original analysis to demonstrate robustness. We will also clarify in the Results and Discussion how changes in short-lived binding kinetics are reconciled with the increase in detectable binding events at the Early stage, drawing on the apparent engagement-rate proxy interpreted alongside the GRID-derived state-spectrum analysis.

      (7) CUT&Tag peak definition and quantitative analysis

      Report (a) signal intensity distribution at the 1,167 sites across stages (scatter or density plot beyond the heatmap) or (b) differential binding analysis (e.g., DESeq2). State replicate count and overlap of Early-restricted sets across replicates.

      Agreed; normalized fold-change analysis completed, with replicate-aware differential binding analysis planned if additional replicates are generated.

      We have performed a normalized count-based fold-change analysis of the union peak set from the existing GATA2 CUT&Tag dataset (14,468 peaks) using the goodpeaks framework previously used in our group, yielding per-peak log2 fold-change values and discrete dynamicstatus calls (Gained / Lost / Unchanged at |log2FC| ≥ 2) for each of the two transitions (Basal → Early at 0 vs 2 h, and Early → Late at 2 vs 24 h). This provides a conservative quantitative complement to the presence/absence peak-calling analysis presented in Figure 5; if additional replicate data are generated, we will perform replicate-aware differential binding analysis (DiffBind/DESeq2; Love et al., 2014; Stark & Brown, 2011) and report replicate overlap. This analysis addresses option (b) of the reviewer’s request and also enables the visualization requested in option (a) as a cross-stage scatter (Author response image 1). We present the quantitative analysis as a supplement to the presence/absence-defined Early-restricted set in Figure 5 of the manuscript, providing two orthogonal lines of evidence for the same biology. We note that the CUT&Tag experiments were initially performed as a validation step to confirm that the tagged GATA2-Halo constructs recapitulate endogenous chromatin-binding behavior, including appropriate genomic localization and expected GATA switch dynamics. This validation supports the conclusion that the observed single-molecule kinetics reflect physiologically relevant GATA2 engagement. Having established this, we subsequently extended the dataset to perform the quantitative analyses presented here.

      Quantitative findings.

      - 384 peaks were Gained (|log2FC| ≥ 2) at the Basal → Early transition.

      - 1,006 peaks were Lost over the same transition.

      - 178 peaks were Gained at Basal → Early and subsequently Lost at Early → Late, defining the strict differentially-restricted Early set (Author response image 1, red points). This set represents the higher-confidence subset of the manuscript’s broader presence/absence-defined Earlyrestricted set (n = 1,167; defined as MACS2 peaks at q < 0.01 present at Early but absent at Basal and Late).

      - 200 peaks were Gained at Early and retained at Late, indicating stable acquisition.

      - 49 peaks were acquired only at the Late stage.

      The discrepancy between the broader presence/absence set (1,167) and the strict differential set (178) reflects the analytical choice the reviewer raised: presence/absence calls based on a peaksignificance threshold are sensitive to near-threshold peaks, whereas differential analysis with a fold-change cutoff captures only sites with quantitatively pronounced stage-restricted enrichment. We interpret these as two complementary definitions: the broader set captures all peaks meeting a stage-specific peak-calling criterion, and the strict subset isolates the most quantitatively dynamic core of that population.

      Importantly, the three named example loci shown in Figure 5D of the manuscript — Nono (promoter-proximal), Nr3c1 (intron 2), and Gata3 (distal intergenic) — all survive the strict differential criterion (each shows |log<sup>2</sub>FC| ≥ 2 in both transitions, consistent with a clean Gainedthen-Lost signature). The published example panel therefore represents the high-confidence intersection of both definitions, supporting the robustness of the manuscript’s selected illustrative cases.

      We will explicitly state the number of CUT&Tag replicates per stage in the revised Methods and figure legends. Where the differential analysis is currently based on a single replicate per stage, we will explicitly note this and treat the strict subset as a conservative confirmatory analysis. An additional replicate is under consideration for the full revision, and if performed, overlap of Earlyrestricted calls across replicates will be reported.

      Motif cross-validation against a matched-GC background using HOMER and/or MEME-ChIP is planned for the strict differential subset and will be reported alongside the original SeqPos analysis in the revised Figure 5F or its supplement.

      Author response image 1.

      Cross-stage log<sub>2</sub> fold-change scatter for GATA2 CUT&Tag peaks. Each point represents a single peak in the union peak set (n = 14,468). The x-axis shows the log2 fold change from Basal (0 h) to Early (2 h); the y-axis shows the log2 fold change from Early (2 h) to Late (24 h). The sign convention follows the field-standard direction (positive log2FC = increased signal at the later time point). Peaks are colored by dynamic-status classification: unchanged/other (gray; n = 9,794); Lost at Early (blue; n = 109); Gained at Early and retained at Late (orange; n = 200); acquired only at Late (teal; n = 49); and Early-restricted, defined as Gained at Early and Lost at Late with |log2FC| ≥ 2 in both transitions (red; n = 178). The Early-restricted population occupies the lower-right quadrant, consistent with a transient kinetic peak of GATA2 binding.

      Author response image 2.

      Density representation of GATA2 CUT&Tag peak dynamics with Early-restricted peaks highlighted.

      Author response image 2 is shown for illustrative reference and is not annotated with a separate legend; it presents the same data as Author response image 1 in a hexbin density format to emphasize the bulk of unchanged peaks at the origin and the spatial separation of the Early-restricted set.

      Author response image 3.

      Genomic-annotation comparison of newly acquired GATA2 binding at Early. Stacked-bar comparison of genomic annotations (ChIPseeker classification) for two definitions of the newly acquired GATA2 peaks at the Early erythroid stage: all peaks Gained at Basal → Early (orange; n = 384) and the strict Early-restricted subset (Gained then Lost; red; n = 178). Annotation categories shown: Promoter (≤1 kb of TSS), Intron, Distal Intergenic, and Other (Exon, 5′/3′ UTR, Downstream). Both peak sets contain substantial promoter-proximal and distal/intronic components, consistent with the two-subclass model described in Figure 5E–G of the manuscript (GATA2-only promoter-proximal peaks with GATA/RUNX motifs, and GATA2/GATA1 cobound distal peaks with composite GATA/E-box motifs). The strict subset shows a higher proportion of intronic and distal-intergenic sites and a lower proportion of promoter-proximal sites than the full Gained set; this difference will be discussed transparently in the revised Results. Motif analysis (HOMER/MEME-ChIP, planned for the full revision) will be performed on both peak sets to confirm that the GATA/RUNX and GATA/E-box subclass signatures are preserved.

      (8) Knock-in mouse hematopoietic validation

      A brief characterization of basic hematopoietic parameters in homozygotes (CBC, LSK/HSPC frequencies, or colony assays) would confirm the tagged allele is physiological.

      Agreed; data acquired and analyzed.

      We have characterized mature trilineage hematopoietic populations in whole bone marrow from wild-type, heterozygous (Gata2Het), and homozygous (Gata2Homo) Gata2-SNAP knock-in mice (n = 5 per genotype). Bone marrow cells were stained for myeloid (CD11b<sup>+</sup> Gr1<sup>+</sup>), lymphoid (CD3<sup>+</sup>/CD4<sup>+</sup>/CD8<sup>+</sup>/B220<sup>+</sup>/CD19<sup>+</sup>), and erythroid (Ter119<sup>+</sup>) markers and analyzed by flow cytometry. Lineage frequencies are shown as percentages of live bone marrow cells in a new Figure Supplement in the revised manuscript.

      For myeloid and erythroid populations, omnibus one-way ANOVA detected no significant differences across genotypes (Myeloid: F(2,12) = 2.616, P = 0.1140; Erythroid: F(2,12) = 0.4943, P = 0.6219). Dunnett’s multiple-comparisons test against the WT control did not detect significant pairwise differences for either knock-in genotype (Myeloid: WT vs Het P = 0.1351, WT vs Homo P = 0.9926; Erythroid: WT vs Het P = 0.7017, WT vs Homo P = 0.9602).

      For the lymphoid compartment, although the omnibus ANOVA reached significance (F(2,12) = 6.690, P = 0.0112), no pairwise comparison against WT remained significant after multiplecomparisons correction (Dunnett’s adjusted P values: WT vs Het = 0.1217; WT vs Homo = 0.2078). We therefore interpret this result conservatively. Brown-Forsythe and Bartlett’s tests showed no significant differences in variance across genotypes (P = 0.1423 and P = 0.0908), so the result is not attributable to unequal variances. We do not interpret these data as indicating an unambiguous lymphoid phenotype in either heterozygous or homozygous Gata2-SNAP mice; this interpretation is consistent with the broader pattern across all three lineages, in which no pairwise comparison against WT survives multiple-comparisons correction. We will note in the figure legend and in the Results text that more granular HSPC-compartment analysis (LSK, MPP, lineage-restricted progenitor frequencies) and a complete blood count (CBC) remain valuable directions for future characterization of this resource and will be considered for the full revision if specifically requested.

      Author response image 4.

      Bone marrow trilineage frequencies in Gata2-SNAP knock-in mice. Bone marrow was harvested from the femurs and tibias of wild-type (WT), heterozygous (Gata2Het), and homozygous (Gata2Homo) Gata2-SNAP knock-in mice (n = 5 per genotype; mixed sex; 12–14 weeks). After ACK lysis, cells were stained for myeloid (CD11b<sup>+</sup> Gr1<sup>+</sup>), lymphoid (CD3<sup>+</sup>/CD4<sup>+</sup>/CD8<sup>+</sup>/B220<sup>+</sup>/CD19<sup>+</sup>), and erythroid (Ter119<sup>+</sup>) markers and analyzed by flow cytometry. Each dot represents one mouse, and horizontal bars indicate genotype means. Statistical results: Myeloid: ANOVA F(2,12) = 2.616, P = 0.1140; Dunnett’s adjusted P values WT vs Het = 0.1351, WT vs Homo = 0.9926. Lymphoid: ANOVA F(2,12) = 6.690, P = 0.0112 (omnibus); Dunnett’s adjusted P values WT vs Het = 0.1217, WT vs Homo = 0.2078. Erythroid: ANOVA F(2,12) = 0.4943, P = 0.6219; Dunnett’s adjusted P values WT vs Het = 0.7017, WT vs Homo = 0.9602. Brown-Forsythe and Bartlett’s tests for unequal variance were non-significant in all three lineages. Although the lymphoid omnibus ANOVA reached nominal significance, no pairwise comparison with WT remained significant after multiple-comparison correction; we therefore interpret this result conservatively (see response to R3 point 8).

      Summary

      We thank the editors and the three reviewers for the constructive and detailed assessment. The planned revisions consist of:

      - Four new experiments [planned] (HaloTag/SNAP labeling efficiency and absolute molecule counts via U2OS reference standards; H2B-HaloTag photobleaching reference; percell quantification of total endogenous GATA2 in flow-sorted primary Gata2-SNAP populations via saturating SNAP-tag labeling, benchmarked against a SNAP-RPB1/U2OS reference standard; single-molecule tracking of GATA2 N-terminal, C-terminal, and double zinc-finger deletion mutants in the engineered cell systems as a binding-deficient functional control).

      - Six analyses of existing data (GRID multi-state fitting [planned]; per-cell bleach-rate distributions and photobleach-corrected residence times [planned]; tracking-parameter sensitivity [planned]; nuclear-area normalization and total-displacement controls [planned]; normalized fold-change CUT&Tag analysis [completed; motif cross-validation planned], presented in Author response images 1–3; distribution-based single-cell statistics [planned]).

      - One previously-acquired dataset [completed] (trilineage hematopoietic flow cytometry of homozygous Gata2-SNAP knock-in mice; presented in Author response image 4 with full statistical detail).

      - Substantial revisions to text and figures [planned] to address statistical reporting, methodological description, mechanistic framing of cross-system differences, and refinement of the Figure 6 schematic.

      With respect to the requested binding-deficient single-molecule control, we will attempt to address this directly using sequence-validated lentiviral constructs in hand encoding GATA2 mutants lacking the C-terminal zinc finger, the N-terminal zinc finger, or both. These mutant analyses will be complemented by GRID multi-state analysis and H2B-HaloTag controls, providing converging lines of validation for the two-state kinetic framework. We note that an analogous mutant cannot be examined in the physiological context of the Gata2-SNAP knock-in mouse, and we will frame the cell-line mutant analyses accordingly.

      We believe these revisions directly address the editors’ specific guidance regarding labeling efficiency and methodological clarification. We thank the editors and reviewers for their time and look forward to submitting the revised manuscript.

      References cited in this response:

      References listed below are cited in this provisional response in support of the planned analyses and methodology.

      Cattoglio, C., Pustova, I., Walther, N., Ho, J. J., Hantsche-Grininger, M., Inouye, C. J., Hossain, M. J., Dailey, G. M., Ellenberg, J., Darzacq, X., Tjian, R., & Hansen, A. S. (2019). Determining cellular CTCF and cohesin abundances to constrain 3D genome models. eLife, 8, e40164. https://doi.org/10.7554/eLife.40164

      Gebhardt, J. C. M., Suter, D. M., Roy, R., Zhao, Z. W., Chapman, A. R., Basu, S., Maniatis, T., & Xie, X. S. (2013). Single-molecule imaging of transcription factor binding to DNA in live mammalian cells. Nature Methods, 10(5), 421–426. https://doi.org/10.1038/nmeth.2411

      Hansen, A. S., Pustova, I., Cattoglio, C., Tjian, R., & Darzacq, X. (2017). CTCF and cohesin regulate chromatin loop stability with distinct dynamics. eLife, 6, e25776. https://doi.org/10.7554/eLife.25776

      Haque, N., & Coleman, R. A. (2025). Dynamic transcription pre-initiation complex assembly governs initiation efficiency. bioRxiv. https://doi.org/10.1101/2025.05.07.652662

      Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell, 38(4), 576–589. https://doi.org/10.1016/j.molcel.2010.05.004

      Kaya-Okur, H. S., Wu, S. J., Codomo, C. A., Pledger, E. S., Bryson, T. D., Henikoff, J. G., Ahmad, K., & Henikoff, S. (2019). CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nature Communications, 10(1), 1930. https://doi.org/10.1038/s41467-019-09982-5

      Kenworthy, C. A., Haque, N., Liou, S.-H., Chandris, P., Wong, V., Dziuba, P., Lavis, L. D., Liu, W.-L., Singer, R. H., & Coleman, R. A. (2022). Bromodomains regulate dynamic targeting of the PBAF chromatin-remodeling complex to chromatin hubs. Biophysical Journal, 121(9), 1738–1752. https://doi.org/10.1016/j.bpj.2022.03.027

      Ling, Y. H., Liang, C., Wang, S., & Wu, C. (2026). Live-cell single-molecule dynamics of eukaryotic RNA polymerase machineries. Science, 391, eads0960. https://doi.org/10.1126/science.ads0960

      Liu, Z., Legant, W. R., Chen, B.-C., Li, L., Grimm, J. B., Lavis, L. D., Betzig, E., & Tjian, R. (2014). 3D imaging of Sox2 enhancer clusters in embryonic stem cells. eLife, 3, e04236. https://doi.org/10.7554/eLife.04236

      Loeffler, D., Wang, W., Hopf, A., Hilsenbeck, O., Bourgine, P. E., Rudolf, F., Martin, I., & Schroeder, T. (2018). Mouse and human HSPC immobilization in liquid culture by CD43- or CD44-antibody coating. Blood, 131(13), 1425–1429. https://doi.org/10.1182/blood-2017-07-794131

      Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8

      Machanick, P., & Bailey, T. L. (2011). MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics, 27(12), 1696–1697. https://doi.org/10.1093/bioinformatics/btr189

      Normanno, D., Boudarène, L., Dugast-Darzacq, C., Chen, J., Richter, C., Proux, F., Bénichou, O., Voituriez, R., Darzacq, X., & Dahan, M. (2015). Probing the target search of DNA-binding proteins in mammalian cells using TetR as model searcher. Nature Communications, 6, 7357. https://doi.org/10.1038/ncomms8357

      Palii, C. G., Cheng, Q., Gillespie, M. A., Shannon, P., Mazurczyk, M., Napolitani, G., Price, N. D., Ranish, J. A., Morrissey, E., Higgs, D. R., & Brand, M. (2019). Single-cell proteomics reveal that quantitative changes in co-expressed lineage-specific transcription factors determine cell fate. Cell Stem Cell, 24(5), 812–825.e5. https://doi.org/10.1016/j.stem.2019.02.016

      Sergé, A., Bertaux, N., Rigneault, H., & Marguet, D. (2008). Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. https://doi.org/10.1038/nmeth.1233

      Stark, R., & Brown, G. D. (2011). DiffBind: Differential binding analysis of ChIP-Seq peak data. Bioconductor. http://bioconductor.org/packages/release/bioc/html/DiffBind.html

      Taylor, S. J., Stauber, J., Bohorquez, O., Tatsumi, G., Kumari, R., Chakraborty, J., Bartholdy, B. A., Schwenger, E., Sundaravel, S., Farahat, A. A., Dutta, A., Koche, R. P., Steidl, U., & Wheat, J. C. (2024). Pharmacological restriction of genomic binding sites redirects PU.1 pioneer transcription factor activity. Nature Genetics, 56(10), 2213–2227. https://doi.org/10.1038/s41588-024-01911-7

      Wheat, J. C., Salsman, J., Reekie, I., Mathhwala, A., Black, K. L., Tiedt, R., Shroff, H., & Steidl, U. (2020). Single-molecule imaging of transcription dynamics in somatic stem cells. Nature, 583(7816), 431– 436. https://doi.org/10.1038/s41586-020-2432-4

    1. eLife Assessment

      This manuscript presents a valuable and timely contribution by incorporating desolvation barriers into coarse-grained models of biomolecular condensates. The findings are convincing, supported by a clear physical model and systematic simulations showing effects on phase behavior, packing, and dynamics. Some clarification and broader context would improve the manuscript, but it provides a foundation that will be of use for developing more realistic coarse-grained interaction schemes.

    2. Reviewer #1 (Public review):

      This manuscript is very interesting and timely. By introducing the critical effects of desolvation barriers and solvent (water)-separated minima into the implicit-solvent potentials (of mean force, PMFs) for coarse-grained molecular dynamics simulations of biomolecular liquid-liquid phase separation (LLPS), this work fills a gap that should be apparent to researchers of protein folding in the past couple of decades but has so far escaped deserved attention such that these basic features of aqueous solvation have seldom, though not never, been invoked in recent studies of biomolecular condensates. Although the present paper deals almost exclusively with homopolymers, this work can be a foundation for the future development of a new, more physical coarse-grained interaction scheme for simulating amino acid sequence-dependent effects, which I presume is the authors' ongoing or next endeavor. The results presented in this manuscript are highly valuable.

      However, there is room for improvement in the authors' description of (i) the broader impact of effects of desolvation barrier and solvent-separated minimum in the thermodynamics of biomolecular condensates, especially with regard to the ramifications on hydrostatic pressure-dependent effects; (ii) the physical implication of using a 20-parameter hydropathy scale rather than a 210-parameter pairwise amino acid interaction scheme; and (iii) temperature-dependent effects, including the authors' discussion of "enthalpic" and "entropic" contributions. In all these aspects, the authors' discussion should be put in a more comprehensive context of the existing literature. At a few other places, the description of the methods and results should be clarified as well. Accordingly, the authors should revise the manuscript to address the following items thoroughly within the revised manuscript (not merely in the response letter) with the additional references mentioned below included in the revised discussion:

      (1) In several places, e.g., on line 77 (p.2), the authors appear to suggest that "implicit-solvent representation" is the origin of the deficiency in commonly utilized coarse-grained potentials that this study is aiming to rectify. But desolvation barriers and solvent-separated minima are also features of implicit-solvent representations; they are just features that should be incorporated in more accurate implicit-solvent potentials. This point is stated quite clearly and accurately in the Abstract (p.1) but not consistently in the rest of the text. The authors should check the entire text carefully to ensure that a coherent, accurate perspective is presented.

      (2) In the discussion of the importance of desolvation barriers and solvent-separated minima in the Introduction (pp.1-3), connections should be drawn to recent works that utilize these PMF features to rationalize hydrostatic pressure (P)-modulated effects on biomolecular LLPS, including the P-dependent reentrant phase separation of alpha elastin; see Cinar et al. (2019) Chem Eur J 25:13049 (https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/chem.201902210) and references therein, especially discussions around Figures 10, 11 & 13 in this reference.

      (3) In the lower panels of Figures 2D, E (p.5), what do the differently colored small circles in the double-minimum free energy profiles represent? Does the color shading have the same meaning as that in the upper panels? If so, what do the positions of the circles on the free energy profile represent? The authors should clarify this.

      (4) The discussion regarding entropy and enthalpy around Figure 2 is quite confusing as it stands. What do the authors mean exactly by the association of entropy or enthalpy with the desolvation barrier of the solvent-separated minimum? Are they referring to conformational entropy?

      (5) Do the authors assume that the PMF (effective implicit-solvent potential) is a purely enthalpic term? It appears to be the authors' assumption. If so, the assumption has to be stated clearly in their discussion of "entropy" vs "enthalpy" around Figure 2.

      (6) Closely related to points 3-5 above, it should be stated clearly that the "temperature" used in the authors' simulations does not represent experimental temperature if the authors are using purely enthalpic effective potentials because PMFs are in fact temperature-dependent. This clarification is necessary to avoid misunderstanding. In this regard, it should be noted that temperature-dependent effective interactions have been used for modeling biomolecular condensates in analytical theory (Lin, Song, Forman-Kay & Chan, J Mol Liq 2017, already in the citation list) as well as in coarse-grained molecular dynamics simulations [Dignon et al. (2019) ACS Cent Sci 5:821-830 (https://pubs.acs.org/doi/10.1021/acscentsci.9b00102); Chakravarti & Joseph (2025) Protein Sci 34:e70284 (https://onlinelibrary.wiley.com/doi/10.1002/pro.70284)]. The latter two studies, not cited currently, are particularly relevant and thus should be cited because the authors may wish to incorporate temperature-dependent features in their ongoing or future effort in constructing a more comprehensive coarse-grained interaction scheme for biomolecular LLPS simulation.

      (7) In tackling "entropy" vs "enthalpy", it should be noted that the temperature dependence of the effective interactions entails an entropic contribution (which is itself temperature dependent) in addition to conformational entropy. As for the effective potential with desolvation barrier and solvent-separated minimum, it should be noted that the decomposition into entropic and enthalpic contributions at the direct contact, desolvation barrier, and solvent-separated minimum can be dramatically different, see, e.g., MaCallum et al. (2007) PNAS 104:6206-6210 (https://www.pnas.org/doi/full/10.1073/pnas.0605859104) and references therein.

      (8) P.7, line 340: The proportionality relation follows directly from the standard Flory-Huggins result T_c = T chi(T)/chi_c, thus the proportionality constant is exactly 1/chi_c. Is this the standard relation that the authors are invoking here? The authors should clarify this.

      (9) The study on dynamic consequences on pp.8-11 is interesting, but clarifications are necessary:

      (i) The vertical schematic in Figure 4A should be explained in detail in its entirety. As it stands, no explanation is provided either in the figure caption or in the text. In particular, what does "elasticity driven" refer to?

      (ii) The top snapshot in Figure 4A is labeled t_sim = 0 ns. Does it mean that the snapshot shown is the only chain configuration that the authors used to start the simulation, and that the snapshot does NOT represent the result of any time evolution, no matter how short the duration is? However, if that is the case, why is this snapshot identified with spinodal decomposition if it is not the product of a time evolution from a more homogeneous configuration?

      (iii) Related to (ii) - do the rectangular boxes shown represent the entire simulation box or just part of the box containing the polymer chains? One would imagine that if the top snapshot represents spinodal decomposition, the simulation would have been started at a more uniform distribution a short time prior? Why is this not the case?

      (iv) What precisely do the small yellow beads and black-colored springs in the zoom-in image of Figure 4E represent?

      (10) In discussing dynamic effects, it is useful to draw connections to related works on the effect of chain flexibility on "aging" of condensate [Biswas & Potoyan (2024) PRX 45:9222-9245 (https://journals.aps.org/prxlife/abstract/10.1103/PRXLife.2.023011)] and characterization of viscoelasticity in simulations of biomolecular condensates [Tejedor et al. (2023) J Phys Chem B 127:4441-4459 (https://pubs.acs.org/doi/10.1021/acs.jpcb.3c01292)], as the effects of desolvation can be explored further based on these prior works.

      (11) Much of the present study is based on the original HPS formulation of Dignon et al. (2018). In this regard and also in anticipation of future development of improved interaction schemes, several issues should be stated and discussed, even if briefly:

      (i) The original HPS model has a basic shortcoming in accounting for the relative interaction strengths of, among others, arginine vs lysine residues [Das et al. (2020) PNAS 117:28795-28805 (https://www.pnas.org/doi/10.1073/pnas.2008122117)].

      (ii) Compared to 210-parameter pairwise interaction schemes, such as KH in Dignon et al. (2018) and Joseph et al. (2021), the 20-parameter interaction scheme is likely too restrictive to account for pairwise amino acid residue interactions [Wessén et al. (2022) J Phys Chem B 45:9222-9245 (https://pubs.acs.org/doi/10.1021/acs.jpcb.2c06181)].

      (iii) The height of the desolvation barrier may vary significantly for different amino acid residue pairs, see, e.g., Figure 11 of Cinar et al. (2019) mentioned above (and references therein). The authors should discuss these nuances in the revised version. They may also wish to take them into consideration in future investigations.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript addresses an important and timely question in the molecular simulation of biomolecular condensates. Most residue-level coarse-grained models used for IDP phase separation employ implicit solvent and represent effective interactions through relatively simple pairwise potentials. While these models have been very useful, they usually do not explicitly distinguish direct contacts from solvent-separated interactions, nor do they include an energetic barrier associated with water removal. This manuscript attempts to address that limitation by introducing desolvation-inspired terms into coarse-grained models and examining their consequences for phase behavior, chain conformations, dense-phase packing, and dynamics.

      Strengths:

      The central idea is physically well motivated. Using a simple homopolymer model, the authors show that increasing the desolvation barrier suppresses phase separation, whereas stabilizing solvent-separated contacts enhances phase separation. They further show that solvent-separated interactions can reduce dense-phase over-compaction, which is a meaningful result given the known challenges in obtaining both accurate single-chain dimensions and realistic dense-phase properties from the same coarse-grained model. The finding that desolvation-like terms can reshape dense-phase packing without simply rescaling the overall interaction strength is interesting and could be useful for future model development. I also found the attempt to connect conformational changes across dilute and dense phases with thermal distance from the critical point to be intriguing. The dynamic analysis, including the FRAP-like simulations and the discussion of kinetic arrest during coarsening, adds another useful dimension to the work.

      Weaknesses:

      At the same time, there are several places where the manuscript would benefit from more careful framing. First, the desolvation terms are still effective coarse-grained parameters rather than a direct representation of water molecules. The language sometimes gives the impression that desolvation is being treated explicitly, whereas the model introduces desolvation-inspired effective interactions into an implicit-solvent framework. Second, the conformational analysis is interesting, but the broader context of prior work on dilute-to-dense phase conformational reorganization of IDPs could be more clearly discussed. This would help clarify what is new in the present work, whether it is the conformational change itself, its dependence on desolvation terms, or the proposed scaling with distance from the critical point. Third, the dynamic results are potentially useful, but the manuscript should more clearly articulate what is nontrivial beyond the expected slowing of local rearrangements by an added barrier in the potential.

      Overall, I think this is a useful and potentially important contribution.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      This manuscript is very interesting and timely. By introducing the critical effects of desolvation barriers and solvent (water)-separated minima into the implicit-solvent potentials (of mean force, PMFs) for coarse-grained molecular dynamics simulations of biomolecular liquid-liquid phase separation (LLPS), this work fills a gap that should be apparent to researchers of protein folding in the past couple of decades but has so far escaped deserved attention such that these basic features of aqueous solvation have seldom, though not never, been invoked in recent studies of biomolecular condensates. Although the present paper deals almost exclusively with homopolymers, this work can be a foundation for the future development of a new, more physical coarse-grained interaction scheme for simulating amino acid sequence-dependent effects, which I presume is the authors' ongoing or next endeavor. The results presented in this manuscript are highly valuable.

      We thank the reviewer for all the insightful comments.

      However, there is room for improvement in the authors' description of (i) the broader impact of effects of desolvation barrier and solvent-separated minimum in the thermodynamics of biomolecular condensates, especially with regard to the ramifications on hydrostatic pressure-dependent effects; (ii) the physical implication of using a 20-parameter hydropathy scale rather than a 210-parameter pairwise amino acid interaction scheme; and (iii) temperature-dependent effects, including the authors' discussion of "enthalpic" and "entropic" contributions. In all these aspects, the authors' discussion should be put in a more comprehensive context of the existing literature. At a few other places, the description of the methods and results should be clarified as well. Accordingly, the authors should revise the manuscript to address the following items thoroughly within the revised manuscript (not merely in the response letter) with the additional references mentioned below included in the revised discussion:

      (1) In several places, e.g., on line 77 (p.2), the authors appear to suggest that "implicit-solvent representation" is the origin of the deficiency in commonly utilized coarse-grained potentials that this study is aiming to rectify. But desolvation barriers and solvent-separated minima are also features of implicit-solvent representations; they are just features that should be incorporated in more accurate implicit-solvent potentials. This point is stated quite clearly and accurately in the Abstract (p.1) but not consistently in the rest of the text. The authors should check the entire text carefully to ensure that a coherent, accurate perspective is presented.

      We thank the reviewer for the insightful comment and suggestion. In this work, rather than departing from the implicit‑solvent modeling framework, our intention is to incorporate the desolvation effect within the implicit solvent model framework. In the revised manuscript, we will revise the text to ensure this point is presented clearly and consistently throughout the paper.

      (2) In the discussion of the importance of desolvation barriers and solvent-separated minima in the Introduction (pp.1-3), connections should be drawn to recent works that utilize these PMF features to rationalize hydrostatic pressure (P)-modulated effects on biomolecular LLPS, including the P-dependent reentrant phase separation of alpha elastin; see Cinar et al. (2019) Chem Eur J 25:13049 (https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/chem.201902210) and references therein, especially discussions around Figures 10, 11 & 13 in this reference.

      We thank the reviewer for bringing these references to our attention. The hydrostatic pressure modulated effects on LLPS provide important context for understanding the physical significance of desolvation barriers and solvent‑separated minima. In the revised manuscript, we will expand the literature discussion by incorporating previous studies on pressure‑modulated phase separation.

      (3) In the lower panels of Figures 2D, E (p.5), what do the differently colored small circles in the double-minimum free energy profiles represent? Does the color shading have the same meaning as that in the upper panels? If so, what do the positions of the circles on the free energy profile represent? The authors should clarify this.

      We thank the reviewer for the suggestion to improve the clarity of the figure. In the lower panels of Figures 2D and 2E, the colored dots were intended solely as a qualitative illustration of the populations of residue‑pair configurations along the effective energy surface. Their colors are not related to the color scale used in the phase diagrams shown in the upper panels. We will modify the color scheme to improve clarity.

      (4) The discussion regarding entropy and enthalpy around Figure 2 is quite confusing as it stands. What do the authors mean exactly by the association of entropy or enthalpy with the desolvation barrier of the solvent-separated minimum? Are they referring to conformational entropy?

      We apologize for the confusion. When the desolvation barrier is high, configurations with inter‑residue distances corresponding to the barrier region become difficult to access, thereby reducing the conformational entropy of the condensed phase. This interpretation is supported by Figure 2—figure supplement 1C, where increasing the desolvation barrier decreases the population in the barrier region of the radial distribution function, indicating that fewer residue‑pair configurations are sampled there. In contrast, increasing the depth of the solvent‑separated minimum makes the condensed phase more energetically favorable. In the revised manuscript, we will incorporate this discussion to improve clarity.

      (5) Do the authors assume that the PMF (effective implicit-solvent potential) is a purely enthalpic term? It appears to be the authors' assumption. If so, the assumption has to be stated clearly in their discussion of "entropy" vs "enthalpy" around Figure 2.

      We thank the reviewer for highlighting this important point. In this work, the PMF profile is constructed from atomistic simulation results, and thus both entropic and enthalpic contributions shape the overall PMF. In the revised manuscript, we will clarify that the PMF represents a free‑energy profile along the intermolecular distance and therefore incorporates enthalpic and entropic contributions from the solute, solvent, and configurational degrees of freedom.

      (6) Closely related to points 3-5 above, it should be stated clearly that the "temperature" used in the authors' simulations does not represent experimental temperature if the authors are using purely enthalpic effective potentials because PMFs are in fact temperature-dependent. This clarification is necessary to avoid misunderstanding. In this regard, it should be noted that temperature-dependent effective interactions have been used for modeling biomolecular condensates in analytical theory (Lin, Song, Forman-Kay & Chan, J Mol Liq 2017, already in the citation list) as well as in coarse-grained molecular dynamics simulations [Dignon et al. (2019) ACS Cent Sci 5:821-830 (https://pubs.acs.org/doi/10.1021/acscentsci.9b00102); Chakravarti & Joseph (2025) Protein Sci 34:e70284 (https://onlinelibrary.wiley.com/doi/10.1002/pro.70284)]. The latter two studies, not cited currently, are particularly relevant and thus should be cited because the authors may wish to incorporate temperature-dependent features in their ongoing or future effort in constructing a more comprehensive coarse-grained interaction scheme for biomolecular LLPS simulation.

      We thank the reviewer for raising this important point. We agree that PMFs and the corresponding effective interactions should be temperature dependent, and therefore the simulation temperature in our current temperature-independent CG potential cannot be interpreted as a fully quantitative experimental temperature. In the revised manuscript, we will clarify the above point. We will also expand the discussion to include previous studies that introduced temperature-dependent effective interactions in analytical theories and coarse-grained simulations of biomolecular condensates.

      (7) In tackling "entropy" vs "enthalpy", it should be noted that the temperature dependence of the effective interactions entails an entropic contribution (which is itself temperature dependent) in addition to conformational entropy. As for the effective potential with desolvation barrier and solvent-separated minimum, it should be noted that the decomposition into entropic and enthalpic contributions at the direct contact, desolvation barrier, and solvent-separated minimum can be dramatically different, see, e.g., MaCallum et al. (2007) PNAS 104:6206-6210 (https://www.pnas.org/doi/full/10.1073/pnas.0605859104) and references therein.

      We agree that a temperature‑dependent PMF includes entropic contributions beyond the configurational entropy discussed around Figure 2. In the present manuscript, our discussion of entropy in that context refers specifically to the reduced accessible configurational space of residue‑pair states in the coarse‑grained ensemble, rather than to a full thermodynamic decomposition of the PMF. In the revised manuscript, we will make this distinction explicit. We will also note that the direct‑contact minimum, desolvation barrier, and solvent‑separated minimum may each have distinct enthalpic and entropic components, and that resolving these components would require additional temperature‑dependent PMF calculations. We will discuss this as a limitation of the current model and as a direction for future parameterization.

      (8) P.7, line 340: The proportionality relation follows directly from the standard Flory-Huggins result T_c = T chi(T)/chi_c, thus the proportionality constant is exactly 1/chi_c. Is this the standard relation that the authors are invoking here? The authors should clarify this.

      We thank the reviewer for pointing this out. Yes, our argument uses the condition that chi_c is fixed at the critical point for a given chain length. We will revise the text to explicitly state this relation and add the missing intermediate step, so that the proportionality used in the manuscript is clearer.

      (9) The study on dynamic consequences on pp.8-11 is interesting, but clarifications are necessary:

      (i) The vertical schematic in Figure 4A should be explained in detail in its entirety. As it stands, no explanation is provided either in the figure caption or in the text. In particular, what does "elasticity driven" refer to?

      (ii) The top snapshot in Figure 4A is labeled t_sim = 0 ns. Does it mean that the snapshot shown is the only chain configuration that the authors used to start the simulation, and that the snapshot does NOT represent the result of any time evolution, no matter how short the duration is? However, if that is the case, why is this snapshot identified with spinodal decomposition if it is not the product of a time evolution from a more homogeneous configuration?

      (iii) Related to (ii) - do the rectangular boxes shown represent the entire simulation box or just part of the box containing the polymer chains? One would imagine that if the top snapshot represents spinodal decomposition, the simulation would have been started at a more uniform distribution a short time prior? Why is this not the case?

      (iv) What precisely do the small yellow beads and black-colored springs in the zoom-in image of Figure 4E represent?

      We thank the reviewer for pointing out these unclear issues in Figure 4. In the revised manuscript, we will better explain the vertical schematic in Figure 4A, including the progression from the early growth of density fluctuations, to intermediate kinetic arrest, and finally to late-stage coarsening. We will also clarify that “elasticity driven” refers to the resistance to domain deformation caused by transient inter-chain network connectivity. We will clarify that t_sim = 0 denotes the time immediately after the temperature quench from the high-temperature homogeneous state to the low-temperature two-phase region. This snapshot is the post-quench initial configuration, while spinodal decomposition refers to the subsequent amplification of density fluctuations after the quench. The displayed snapshot is one representative trajectory, not the only initial configuration used in the simulations. The quantitative kinetic analysis was averaged over multiple independent trajectories. The rectangular box represents the entire simulation box. Although the system was equilibrated at high temperature before the quench, instantaneous density fluctuations remain, so the initial configuration is not perfectly uniform. In Figure 4E, the yellow beads represent interacting residue pairs. The black springs schematically represent the transient elastic network formed by these interactions, rather than a precise structural model.

      (10) In discussing dynamic effects, it is useful to draw connections to related works on the effect of chain flexibility on "aging" of condensate [Biswas & Potoyan (2024) PRX 45:9222-9245 (https://journals.aps.org/prxlife/abstract/10.1103/PRXLife.2.023011)] and characterization of viscoelasticity in simulations of biomolecular condensates [Tejedor et al. (2023) J Phys Chem B 127:4441-4459 (https://pubs.acs.org/doi/10.1021/acs.jpcb.3c01292)], as the effects of desolvation can be explored further based on these prior works.

      We thank the reviewer for this helpful suggestion. In the revised Discussion, we will cite and discuss the related studies on condensate aging and viscoelasticity, including the effects of chain flexibility, sticker lifetime, desolvation, and transient network formation on condensate material properties. These works provide an important context for interpreting our dynamic results. We will clarify that desolvation may influence condensate dynamics not only by slowing local rearrangements, but also by modulating transient network connectivity, kinetic arrest, and viscoelastic relaxation.

      (11) Much of the present study is based on the original HPS formulation of Dignon et al. (2018). In this regard and also in anticipation of future development of improved interaction schemes, several issues should be stated and discussed, even if briefly:

      (i) The original HPS model has a basic shortcoming in accounting for the relative interaction strengths of, among others, arginine vs lysine residues [Das et al. (2020) PNAS 117:28795-28805 (https://www.pnas.org/doi/10.1073/pnas.2008122117)].

      (ii) Compared to 210-parameter pairwise interaction schemes, such as KH in Dignon et al. (2018) and Joseph et al. (2021), the 20-parameter interaction scheme is likely too restrictive to account for pairwise amino acid residue interactions [Wessén et al. (2022) J Phys Chem B 45:9222-9245 (https://pubs.acs.org/doi/10.1021/acs.jpcb.2c06181)].

      (iii) The height of the desolvation barrier may vary significantly for different amino acid residue pairs, see, e.g., Figure 11 of Cinar et al. (2019) mentioned above (and references therein). The authors should discuss these nuances in the revised version. They may also wish to take them into consideration in future investigations.

      We thank the reviewer for highlighting these important limitations of the original HPS-based framework. We agree that a 20‑parameter hydropathy‑scale model has limitation in fully capturing residue‑pair‑specific interactions, including well‑established differences such as those between arginine and lysine. In the revised manuscript, we will explicitly discuss this limitation and cite the suggested studies on residue‑specific and pairwise interaction schemes. We also agree that desolvation barriers and solvent‑separated minima are likely to depend on amino‑acid pair identity. In the present work, we employ a simplified, residue‑independent desolvation parameterization to isolate the general thermodynamic and kinetic consequences of desolvation in coarse‑grained LLPS simulations. In the revised Discussion, we will clarify this scope and emphasize that developing residue‑pair‑specific desolvation parameters, potentially within a 210‑parameter interaction framework, is an important direction for future work.

      Reviewer #2 (Public review):

      Summary:

      This manuscript addresses an important and timely question in the molecular simulation of biomolecular condensates. Most residue-level coarse-grained models used for IDP phase separation employ implicit solvent and represent effective interactions through relatively simple pairwise potentials. While these models have been very useful, they usually do not explicitly distinguish direct contacts from solvent-separated interactions, nor do they include an energetic barrier associated with water removal. This manuscript attempts to address that limitation by introducing desolvation-inspired terms into coarse-grained models and examining their consequences for phase behavior, chain conformations, dense-phase packing, and dynamics.

      Strengths:

      The central idea is physically well motivated. Using a simple homopolymer model, the authors show that increasing the desolvation barrier suppresses phase separation, whereas stabilizing solvent-separated contacts enhances phase separation. They further show that solvent-separated interactions can reduce dense-phase over-compaction, which is a meaningful result given the known challenges in obtaining both accurate single-chain dimensions and realistic dense-phase properties from the same coarse-grained model. The finding that desolvation-like terms can reshape dense-phase packing without simply rescaling the overall interaction strength is interesting and could be useful for future model development. I also found the attempt to connect conformational changes across dilute and dense phases with thermal distance from the critical point to be intriguing. The dynamic analysis, including the FRAP-like simulations and the discussion of kinetic arrest during coarsening, adds another useful dimension to the work.

      We thank the reviewer for all these positive and constructive assessment and comments. We are encouraged that the reviewer found the central idea physically well motivated and recognized the value of introducing desolvation-inspired terms to distinguish direct contacts, solvent-separated interactions, and the energetic barrier associated with water removal in coarse-grained models of biomolecular condensates.

      Weaknesses:

      At the same time, there are several places where the manuscript would benefit from more careful framing. First, the desolvation terms are still effective coarse-grained parameters rather than a direct representation of water molecules. The language sometimes gives the impression that desolvation is being treated explicitly, whereas the model introduces desolvation-inspired effective interactions into an implicit-solvent framework.

      We agree that the current wording should more clearly reflect the nature of our model. The desolvation terms introduced in this work are effective coarse-grained interaction terms rather than an explicit molecular representation of water. In the revised manuscript, we will carefully revise the language throughout the article to describe the model as incorporating desolvation-inspired effective interactions within an implicit-solvent coarse-grained framework.

      Second, the conformational analysis is interesting, but the broader context of prior work on dilute-to-dense phase conformational reorganization of IDPs could be more clearly discussed. This would help clarify what is new in the present work, whether it is the conformational change itself, its dependence on desolvation terms, or the proposed scaling with distance from the critical point.

      We appreciate this suggestion. In the revised manuscript, we will place the conformational analysis in the context of prior work and discuss the observed conformational changes more explicitly from the perspective of desolvation-inspired interactions. We will also clarify the assumptions behind the scaling relation between conformational change and thermal distance from the critical point.

      Third, the dynamic results are potentially useful, but the manuscript should more clearly articulate what is nontrivial beyond the expected slowing of local rearrangements by an added barrier in the potential.

      We thank the reviewer for the suggestion. In the revised manuscript, we will clarify which aspects of the observed dynamics can be directly expected from the added desolvation barrier and which trends arise from the combined effects of desolvation on packing density, chain mobility, kinetic arrest, and coarsening.

      We again thank the editors and reviewers for their constructive comments and suggestions. We believe that the planned revisions will improve the precision of the model description, clarify the physical interpretation of the desolvation-inspired terms, expand the relevant literature context, and better define the scope and limitations of the current framework.

    1. eLife Assessment

      This study presents a valuable framework that uses anticipatory eye movements to track how expectations are formed and revised during implicit probabilistic sequence learning. The evidence supporting a behavioural dissociation between errors arising from environmental noise and errors reflecting an inaccurate internal model is solid, but the oculomotor data describe behaviour rather than explain the underlying computational mechanisms, and the stronger mechanistic claims - that learning is more repetition-based than error-driven - remain incomplete without formal comparison against computational models of error-driven learning. The emerging reaction-time difference between conditions appears driven by slowing to low-probability stimuli rather than facilitation of high-probability ones, an asymmetry that requires decomposition and consideration of alternative explanations. The potential contamination of the anticipatory measure by starting gaze position should be addressed through control analyses, and the "process-pure" framing should be tempered, given that oculomotor behaviour is itself subject to motor learning.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript presents an original quantitative approach for tracking the online formation and updating of prior beliefs. In an Alternating Serial Reaction Time task, participants were exposed to probabilistic visual streams, and their pre-stimulus saccadic behavior (i.e., the first eye movement after the previous stimulus disappeared) was monitored via eye-tracking. Since the stimuli followed an alternating probabilistic sequence, upcoming events did not appear with full certainty: some stimuli had a higher, some a lower probability. By comparing anticipatory oculomotor behavior between high and low probability events, the authors dissociated between learning/belief updating and general oculomotor noise. Noise-driven errors were more frequent than learning-dependent errors, which nonetheless triggered more belief updating (i.e., a change in oculomotor behavior in a subsequent encounter of the same event). Interestingly, updating depended more strongly on whether a prior belief was consistent with the task's probabilistic structure than on prediction errors. These findings suggest that incidental, implicit statistical learning may rely on conservative updating with a relatively low learning rate, or on errorless algorithms, rather than prediction errors per se.

      Strengths:

      By applying a fine-grained analysis of anticipatory oculomotor behavior, this work establishes new continuous metrics to quantify the gradual learning and refinement of prior expectations during statistical learning. These metrics provide convincing evidence of the dynamics of anticipatory oculomotor behavior.

      The method is paradigm-independent, offering generalizable metrics for tracking the dynamic formation and refinement of predictive models in any task involving probabilistic stimulus streams. In the future, computational modeling may leverage these continuous metrics to better dissect the mechanisms underlying statistical learning.

      Weaknesses:

      The authors subscribe to the idea that statistical learning is not a unified concept but rather is implemented via multiple underlying mechanisms. However, it remains unspecified what these different mechanisms could be, and how eye movements could contribute to distinguishing between them.

      The authors claim that they developed a novel methodological approach to probe whether anticipatory eye movements directly reflect priors, thereby filling an outstanding gap. However, this claim ignores mounting relevant work on structure learning using eye-tracking in the developmental field.

      The authors claim that their framework quantifies trial-by-trial oculomotor dynamics, while in fact the analyses use epochs (i.e. groups of multiple trials) as predictors. Why not use trial number as a predictor to truly investigate trial-by-trial dynamics that directly reflect anticipation, surprisal, and revision?

    3. Reviewer #2 (Public review):

      Summary:

      Hann and colleagues introduce a gaze-based analytical framework designed to capture, on a trial-by-trial basis, how people form and revise their predictions during implicit probabilistic sequence learning. Using an eye-tracking adaptation of an alternating sequence task, they record the first anticipatory saccade during the response-stimulus interval and classify each such saccade along two dimensions: whether it was directed toward a high- or low-probability upcoming stimulus (the learning-dependent vs. not-learning-dependent distinction), and whether the anticipated location coincided with the stimulus that actually appeared. A complementary iterative-updating metric codes whether a participant's prediction for a given three-element context is repeated or revised on successive encounters of that context.

      On the basis of these measures, the authors report that errors congruent with the inferred regularity - which they interpret as reflecting environmental noise - become progressively more frequent than errors reflecting an inaccurate internal model; that participants show a pronounced tendency to repeat their previous prediction rather than revise it; and that updates depend more on whether a prior belief is congruent with the task's statistical structure than on whether the previous prediction was confirmed. They interpret these results as evidence that statistical learning is less error-driven and more repetition-based (Hebbian in character) than is typically assumed.

      Strengths:

      The methodological ambition of the work is considerable, and the paper makes several contributions that are likely to be useful to the implicit-learning and predictive-processing communities. Using the first anticipatory saccade as a pre-response behavioral readout of prediction is conceptually well-motivated: it provides a trial-by-trial index of predictive orienting at a temporal resolution that manual reaction times cannot deliver, and it does so before the outcome of the trial is known. The explicit distinction between errors arising because the task's outcome is stochastic - that is, predictions congruent with the statistical structure but unconfirmed by the stochastic sample - and errors arising because the internal model is inaccurate is a theoretically meaningful move: predictive-coding and Bayesian accounts have long argued that these two sources of surprise should carry different weight for model revision, and the authors offer a behavioral operationalization of that distinction. The analytical pipeline is not tied to the specific paradigm used here and could be applied to other probabilistic sequence-learning tasks, which gives it broader methodological utility than a single-paradigm report. Finally, the demonstration that learners maintain their prior across successive occurrences of the same context, even when it has been disconfirmed by the most recent outcome, is a robust behavioral observation that speaks directly to an unresolved debate about whether statistical learning is dominantly error-driven.

      Weaknesses:

      The framework and the core behavioral observations are valuable, but several inferential steps - from the gaze signal to the cognitive constructs the authors invoke - are not fully supported by the present design, and these gaps affect how readers should interpret the stronger theoretical conclusions.

      The "process-pure" framing conflates sensitivity with construct purity. The authors repeatedly describe the eye-tracking measure as providing a more process-pure index of statistical learning than manual-response paradigms. Anticipatory saccades are themselves a learned motor behavior - the oculomotor system is among the most plastic motor outputs the primate brain generates, and sequence learning in the saccadic system is well-documented. The present design does not dissociate learning of the statistical structure from learning of the oculomotor sequence that expresses it, so the measure is not, on its face, free from the motor-learning confound that the authors criticize in button-press paradigms. The framing should be read as aspirational rather than as demonstrated by the present data.

      The oculomotor reaction-time data do not show the canonical signature of statistical learning. Reaction times for low-probability trials rise across epochs while those for high-probability trials remain approximately flat (Figure 5). The emerging difference between the two trial types, therefore, appears to be driven by a slowing of responses to low-probability stimuli rather than by a facilitation of responses to high-probability ones, and the authors do not rule out the alternative interpretations that this pattern reflects fatigue, a motor floor effect, or inhibition of unexpected locations. Because no fixation constraint is imposed during the response-stimulus interval, pre-stimulus gaze drift toward the anticipated location will artifactually reduce reaction time on precisely those trials the authors wish to treat as learning-driven; the fact that measured reaction times remain well above zero even on trials classified as correct anticipations is itself evidence that this contamination is present. The oculomotor reaction-time data, therefore, do not provide as clean a verification of learning as the manuscript implies.

      The correct/error labeling of anticipatory saccades incorporates information that the participant did not have. Because the first saccade occurs during the response-stimulus interval - that is, before the upcoming stimulus is revealed - the participant's internal predictive state is identical whether the trial is subsequently classified as a learning-dependent correct response or a learning-dependent error. Any difference in the epochwise frequency of these two categories must therefore be driven, at least in part, by the external stochastic structure of the task rather than by a difference in the predictive process itself. In particular, the observation that learning-dependent errors are the most frequent saccade type (Figure 7) is predicted by the prior probabilities of the outcomes alone, given a high-probability prediction, without appeal to any difference in predictive state. Readers should recognize that the theoretically meaningful contrast is between learning-dependent and not-learning-dependent anticipations (two categories), and that the four-way split risks confounding predictive state with outcome stochasticity.

      The iterative-updating metric does not distinguish prior revision from alternative processes. The binary update / no-update code, computed across non-contiguous occurrences of the same three-element context, does not discriminate between a genuine update of the internal model, simple episodic retrieval of a previously encountered triplet, and oculomotor perseveration. Without a formal generative model to anchor the interpretation, the central theoretical claim - that statistical learning is less error-driven than commonly assumed - is underdetermined by the data. The repetition pattern the authors observe is equally consistent with an error-driven model equipped with a low learning rate in a stable environment, an interpretation the authors themselves acknowledge in the Discussion. Adjudicating between these possibilities requires comparison against explicit computational models, which the present manuscript does not provide.

      Data loss and the absence of fixation control. An interpretable saccade is detected on fewer than half of all trials (48.76%; line 889), and the manuscript does not report the distribution of saccade counts per interval, the per-condition trial counts after all exclusions, or the decomposition of the 20% missing-data threshold into its underlying causes. Given that the entire inferential apparatus rests on this subset of trials, the degree of data loss is a relevant context for the reader. Separately, no fixation constraint is imposed between trials: the participant's starting gaze position at the onset of each response-stimulus interval is whatever position was reached at the end of the preceding response, and this starting position carries trial-history information correlated with the upcoming stimulus. This leaves open the possibility that what is classified as predictive orienting partly reflects the mechanical consequences of where the eye happened to be at the end of the previous trial. The authors defend the absence of a fixation cross on the grounds that it would transform the transitional structure of the task, but this is an empirical claim presented without a supporting citation.

      Heterogeneity within the high-probability condition is not addressed. The two routes to a high-probability triplet in the design - pattern-random-pattern (50% of trials) and random-pattern-random (12.5%) - differ both in their base rate and in the reliability of the contextual cue they provide. Collapsing across these subtypes is an analytical choice that may conceal heterogeneity in the underlying learning process.

      Appraisal: Do the results support the authors' conclusions?

      The framework succeeds in providing a trial-by-trial behavioral readout of predictive orienting that is more fine-grained than conventional reaction-time measures, and the behavioral dissociation between errors congruent with the regularity and errors reflecting an inaccurate internal model is a genuine empirical contribution. The conclusions about the mechanistic nature of statistical learning should be read as motivating hypotheses for future modeling work rather than as settled empirical claims.

      Impact and utility:

      The analytical framework introduced here is likely to be useful to researchers working on implicit learning, predictive processing, and Bayesian models of perception and cognition. The measure of predictive orienting and the iterative-updating code could be adapted to a range of probabilistic learning paradigms, and the behavioral dissociation between noise-driven and model-mismatch errors fills a methodological gap that the field has long acknowledged. The authors share their data and code openly, which will facilitate reuse. The most durable contribution of the paper is methodological; the theoretical claims about the nature of statistical learning will require additional computational modeling before they can be regarded as established.

    4. Author response:

      We thank the Reviewers for their time and effort reviewing our manuscript, we are particularly thankful for the literature recommendations of Reviewer 1, and the analysis ideas of Reviewer 2.

      We are glad that both Reviewers agree that the method we developed provides value to the field. We furthermore agree that our theoretical claims and conclusions could be supported by further analyses. Thus, we primarily plan to focus on this.

      We plan to strengthen our statements by:

      - Comparing our metrics to those of alternative learning processes and hypotheses

      - Additional analyses, including ones using standardized learning scores, collapsed saccade likelihoods for learning-dependent and not-learning-dependent saccades, angular deviations instead of the binary update variable, and a breakdown of high-probability triplets into ones that end with a pattern element or a random one.

      - Adding further information regarding saccades, trials without saccades, and saccade starting points.

      Furthermore, we plan to strengthen our Methods section: some of the Reviewers’ points potentially stem from our unclear description of the ASRT task, thus, the Task & Procedure section needs deeper and clearer explanations. Lastly, we will extend the Introduction, citing the literature recommended in the reviews, which indeed could provide further depth.

    1. eLife Assessment

      This valuable study uses large-scale 7T naturalistic fMRI data and nonlinear pRF modeling to map the tonotopic organization of the human auditory cortex, linking spectral tuning to speech selectivity and cortical hierarchy. The evidence is solid, demonstrating that movie-based stimuli can recover robust population-level auditory maps and offering tools for leveraging existing datasets, although there is room for improvement in relating static tonotopy to dynamic speech processing and in presentation clarity. The study will be of interest to a broad audience working on auditory cortex organization and mapping.

    2. Reviewer #1 (Public review):

      This paper reports an auditory-directed analysis of the HCP 7T short movie dataset. It has the goal of using the film audio to create tonotopic (pRF) maps and combine these with other HCP-provided data (e.g., T1/T2 ratio) to improve understanding of auditory cortex organization and relative functional segregation, particularly in reference to speech processing.

      The paper is ambitious, uses well-founded existing tools for combining data across subjects, and in the Discussion in particular, makes a lot of careful points about interpretation. The paper shows that, at least for a very large dataset on 7T (and for at least a few individual participants) good quality cross-subject-average tonotopic maps can be extracted from fMRI movie datasets via basic spectral modelling of the movie soundtracks. It also suggests ways that these movie-based maps can be combined to come up with potential models of cortical organization. The PCA analysis is a creative way of combining maps (see below for comments)

      These are valuable tools for the field in exploiting/exploring existing data, and I look forward to trying them out myself. I want to emphasize that this is not 'damning with faint praise' - a concrete demonstration of this approach with freely available tools/examples is not only the product of a lot of effort (thank you!), but will be an impetus to research going forward.

      In terms of the contribution to our understanding of auditory cortex organization, using this large N cohort, they replicate a number of findings in the literature from the last couple of decades, including the overlap of low frequency preference with greater speech stimulus preference (e.g. Moerel, de Martino, & Formisano, 2012, J Neuro), patterns of BF width across cortex (Moerel et al., various; Thomas et al. 2015), use of shorter and longer natural sounds (Moerel et al., 2012, 2014; Dick et al., 2012), the importance/influence of sustained spectral attention for tonotopic mapping (da Costa et al., 2013; Dick et al., 2017; Riecke et al. 2017), the use of tonotopy and 'myelin' mapping to establish areal or regional boundaries (Dick et al., 2012; de Martino et al., 2015; Besle et al., 2018, etc) and the overall shape and consistency of tonotopic maps (e.g., Talavage et al., 2004, Humphries et al., 2010 and many others). To my knowledge/memory, this is the first tonotopy paper that has used the cross-subject cortical-surface-based averaging techniques that are driven by more than curvature/sulcal alignment.

      The paper focuses in particular on creating new sets of ROIs based on the various maps derived from the data. Despite being quite familiar with this body of work, I found it difficult to follow how the ROIs were derived, and how and why they were different and/or an improvement over existing parcellation schemes (see for instance Sereno, Sood, & Huang, 2022 for a comprehensive parcellation framework across modalities including auditory, based on combined receptive surface mapping, myelin estimates, and other metrics).

      Given the hour of fast(ish) fMRI data on a 7T with pretty big voxels (so high SNR), one aspect of the results that I found surprising - and potentially informative - was the lack of reliable tonotopic 'mappability' in the majority of participants. The authors' analytic approach to computing the pRFs seems completely reasonable (and shows good average maps), and yet individual maps seem unreliable except for the very best examples. I wondered if this might be due to problems in data collection with earbuds becoming slightly uncoupled and therefore delivering a lot less lower-frequency response and also not preventing scanner noise from getting to the ear; this is often a problem with any in-scanner earbud system (including the Sensimetrics). I wondered if the robustness of the 'speech maps' was associated with that of tonotopy; if they are highly associated, that would suggest that either there were huge individual differences in auditory attention, or perhaps that there was some variability in the acoustic signal delivered to each participant.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors leverage a high-powered 7T fMRI dataset of subjects viewing naturalistic audiovisual movies to elucidate the topographic organization of the human auditory cortex. By applying a nonlinear pRF model, they successfully map tonotopic gradients extending beyond the auditory core into the STG and STS areas. A primary finding is a medial-to-lateral gradient of increasing response compressivity, which the authors claim mirrors the hierarchical cascade architecture of the visual system. Furthermore, the modeling reveals that regions exhibiting high speech selectivity predominantly occupy the low-frequency portions of non-primary tonotopic maps. The authors argue that this architecture reflects an efficient coding mechanism where the cortex magnifies specific spectral features to facilitate the transition from acoustic encoding to flexible speech representation.

      Overall, the study presents concise analyses and compelling high-resolution results that advance our understanding of auditory cortical organization. However, the manuscript currently exhibits several significant theoretical and methodological gaps that temper its broader claims. Most notably, the authors' reliance on a spatial, retinotopic-like analogy overlooks the fundamentally temporal nature of audition. Decoding continuous, natural speech relies heavily on dynamic, full-spectrum temporal integration and contextual recurrent computations, which are difficult to reconcile with the purely static, low-frequency spatial tuning observed here.

      Strengths:

      (1) The utilization of ultra-high-field 7T functional imaging combined with large-scale, naturalistic continuous stimuli provides an excellent signal-to-noise ratio and captures cortical responses under ecologically valid conditions.

      (2) The application of a non-linear pRF encoding model provides a robust, quantitative method for parameterizing and mapping tonotopic features across the cortex, moving beyond simple contrast-based parcellations.

      (3) The manuscript effectively demonstrates the relationship between category selectivity (e.g., speech) and underlying tonotopy, drawing an elegant and structurally useful analogy to the well-established relationship between category selectivity and retinotopy in the visual cortex.

      Weaknesses:

      (1) While the PCA mapping of the functional and structural parameter space is visually compelling, the robustness of this representational geometry across varying acoustic contexts remains ambiguous. Because the model relies on the specific statistical regularities of a single naturalistic audiovisual stimulus set, it is unclear if this low-dimensional structure would hold when tested against isolated speech sounds, environmental noise, or spectrally matched non-speech control stimuli.

      (2) The methodological descriptions currently lack the computational precision required for replication and deep evaluation. I would suggest that the exact mathematical formulation of the encoding model be fully specified in the Methods section. This should include an explicit definition of the objective function, a clear accounting of all terms and hyperparameters utilized during the fitting process, and the exact dimensionalities of both the input feature space and the resulting parameter space.

      (3) There is a critical theoretical disconnect between the observed static, low-frequency tuning in the STG and the known acoustic requirements for continuous speech perception. Speech is a full-spectrum signal; while fundamental frequencies and formants dominate the lower spectrum (which is vital for processing dynamic pitch contours), high-frequency bands (>1 kHz) carry indispensable phonetic information, such as the rapid spectrotemporal dynamics of consonants, especially fricatives. If the speech-responsive cortex is primarily and statically tuned to a low-frequency spectrum, it is unclear how the dynamic, high-frequency spectral information required for semantic decoding is represented. A rich body of electrophysiological literature documents diverse spectrogram coding in the STG. For example, Mesgarani et al. (Science, 2014) demonstrated using spectrotemporal receptive field models that neural populations in the STG are tuned to both low and high-frequency spectrograms well above 1 kHz. The authors must address this discrepancy and attempt to reconcile their static tonotopic findings with the existing literature on dynamic speech encoding.

      (4) While drawing parallels between visual and auditory processing hierarchies is conceptually attractive, the modalities face fundamentally different computational challenges. Vision is largely resolved in space, making a retinotopic spatial coding strategy ecologically and computationally sound. Audition, however, evolves continuously in time. Complex temporal structure, continuous temporal integration, and contextual recurrent computations are paramount for auditory processing, particularly for speech comprehension. In this sense, a purely spatial or tonotopic coding framework is insufficient to fully explain the complex temporal processing dynamics required in the higher-order auditory domain.

    4. Reviewer #3 (Public review):

      Summary:

      The work has the potential to identify the topographical organization of the auditory cortex, which remains controversial with current unnaturalistic sound stimulation, using an elegant approach developed in the visual domain with population receptive field mapping to study the organization of the visual system with naturalistic stimulation conditions.

      Strengths:

      This work presents an analysis of the topographic study of auditory cortical organization, using a substantial Human Connectome Project 7-Tesla functional imaging dataset in which 174 participants viewed naturalistic movies.

      Weaknesses:

      The key issue for the paper is that even the authors seem undecided on what the topographical results are and whether these results are consistent with, refute, or expand our notion of human auditory cortical field organization using this massive dataset obtained under movie-watching conditions. Short of this clarity, and much of the discussion of the issues surrounding topographic mapping is buried in the Supplementary materials section, it is not clear what the authors think the advance of the current work is beyond the large datasets.

      On the flip side, there is little consideration of the challenges of mapping the auditory cortex using naturalistic stimuli that prevent dissociating visual from auditory stimulation conditions, contributing to this clarity or lack thereof in tonotopic mapping.

      As such, the current manuscript struggles to achieve its full potential.

    1. eLife Assessment

      This important study identifies two pairs of dopaminergic neurons (DA-WED) in Drosophila that coordinate cardiac deceleration and locomotor responses to a mechanical threat. The evidence is convincing, supported by comprehensive optogenetic, physiological, and behavioral experiments showing that these neurons are required for and sufficient to drive threat-associated cardiac slowing. The proposed role of cardiac deceleration as an interoceptive contributor to locomotion is intriguing, but should be presented more cautiously, as the causal relationship between heartbeat changes and locomotor output remains less directly established. The work will be of broad interest to those interested in neural circuits, neuromodulation, and the integration of physiological and behavioral responses.

    2. Reviewer #1 (Public review):

      Summary:

      This study by Tsuji et al. explores a mechanical threat model in Drosophila using air puffs as a stimulus. The authors first establish the paradigm and show that air puffs induce cardiac deceleration along with increased locomotion. They then identify dopamine as a key regulator of this response and go on to map the underlying circuit. In doing so, they pinpoint two pairs of DA-WED neurons as critical players. They carefully used intersectional strategies to achieve relatively clean labeling of these neurons, which helps ensure that the observed effects can be attributed specifically to DA-WED neurons. They further show that DA-WED neurons are both required and sufficient to drive cardiac deceleration, and that their activity increases in response to air puff stimulation. These neurons also contribute to the locomotor response. Directly inducing cardiac deceleration via optogenetic manipulation of cardiomyocytes also increases locomotion, suggesting a link between cardiac state and behavioral output.

      Strengths:

      Overall, the experiments are thoughtfully designed, well-controlled, and clearly presented. The figures are easy to follow, and the conclusions are generally well supported by the data. The manuscript is also clearly written, with a discussion that acknowledges potential caveats and outlines future directions. The genetic tools, behavioral paradigm, heart rate measurement approaches, and stimulation methods introduced here will be valuable resources for the community.

      Weaknesses:

      A few minor points to add to the clarity of the manuscript:

      (1) The DA-WED driver (R48A08-AD ∩ VT008692-DBD ∩ TH-FLP) appears quite clean in the brain. However, since the study focuses on cardiac function and locomotion, it would be helpful to check expression in cardiomyocytes and the ventral nerve cord. This would help rule out any off-target expression that might contribute to the phenotypes and further support the idea of a descending pathway from brain dopaminergic neurons.

      (2) Since DA-WED>Kir2.1 abolishes the puff-induced locomotor response (Figure 5b), suggesting that DA-WED neurons are directly involved in mediating locomotion. In the model (Figure 5L), it might make more sense for the pathway from mechanical threat to locomotion to pass through DA-WED neurons. The authors could consider adjusting the schematic if they agree.

      (3) In line 408, Figure 5K should be 5L as it's a discussion of the model.

      (4) In Figure 5j, the x-axis is missing time labels. Even if it matches Figure 5h, adding labels would make it easier to interpret at a glance.

      (5) In line 312, it would be helpful to briefly explain why a 28 ms light pulse was used, compared to other pulse durations elsewhere in the paper.

      (6) The cardiac deceleration seems to recover quickly after the air puff ends, whereas the locomotor response persists longer (around 10-15 seconds; see Figure 1 and Figure 5). This difference might suggest that DA-WED neurons influence locomotion through an additional or partially independent pathway, beyond their role in cardiac regulation. It could be worth briefly discussing this possibility.

    3. Reviewer #2 (Public review):

      Summary:

      The authors study cardiac deceleration during threat responses in Drosophila. Particularly, it focuses on identifying the neuronal control of this deceleration. Using behavioral and cardiac tracking and analysis, genetics, and calcium imaging, they identify two pairs of dopaminergic neurons involved in cardiac deceleration during air puff responses

      Strengths:

      The study is overall well done, and the paper is clearly written. Particularly, the work on identifying the two pairs of dopaminergic neurons involved in cardiac deceleration using a series of drivers and generating new ones is rigorous and extensive. Finally, the authors manipulate the heartbeat to investigate how it influences threat responses

      Weaknesses:

      There are, however, several points that need to be clarified, as some claims are not entirely supported by evidence.

      The authors, for example, claim that dopaminergic neurons are responsible for cardiac deceleration (during the air puff, lines 182-3, page 9). However, based on the work in this study, it seems that other neurons could be involved in this control as well. In addition to dopaminergic neurons, the authors test serotonergic and octopaminergic neurons, which, based on silencing experiments, also show an implication in heart-beat deceleration. Furthermore, because they find that dopaminergic neurons are the only ones that, upon thermogenetic activation, lead to lower heart beat frequency, they conclude that the dopaminergic neurons are responsible for air -puff induced cardiac deceleration.

      However, these activation experiments are done in a different context than the air puff experiments (at a higher temperature, which could have an effect on the heartbeat changes upon activation of different neuron groups), and because silencing of other monoaminergic neuron types during the air puff also resulted in less cardiac deceleration, one cannot exclude the implication of octopaminergic or serotonergic neurons in air-puff-induced deceleration.

      Activation experiments without high temperatures (using, for example, optogenetics) and/or in the presence of the air puff would be important to determine that the dopaminergic neurons are the main type of monoaminergic neurons involved in air-puff-induced cardiac deceleration. Otherwise, the related claims should be rephrased in a way that clearly doesn't exclude a possible implication of other monoaminergic neurons.

      Regarding the interactions between the cardiac deceleration and locomotion, the authors propose, based on the results, that the optogenetic cardiac deceleration is sufficient to induce an increase in locomotion, and that it is the decrease in heartbeat that would be responsible via interoceptive pathways to trigger an increase in locomotion. In the model they propose, the DA-WED neurons would induce a decrease in heartbeat that, in turn, would trigger an increase in locomotion. There is not enough proof that cardiac deceleration is the one that triggers an increase in locomotion during air puff responses. As the authors themselves state, the experiments that would demonstrate this would involve preventing cardiac deceleration while optogenetically activating DA-WED. It can therefore not be excluded that the DA-WED neurons trigger an increase in locomotion that is possibly modulated by the cardiac activity. Both alternatives should be considered (models in Figures 4 and 5).

    4. Reviewer #3 (Public review):

      Summary:

      In this elegant study, Tsuji et al. identify a relationship in Drosophila between cardiodynamics and threatening stimuli where mild air puffs elicit a brief bradycardia that coincides with locomotion increases. They then take advantage of the arsenal of genetic tools available in the fruit fly to reveal the indispensability of dopamine, through the action of Dop1R2, in this phenomenon. Further, they pinpoint the source of this dopamine to two specific pairs of neurons - DA-WED that are threat-activated. They then test and find a potential role for cardiac interoception from the heart in linking behavior and cardiodynamics.

      Strengths:

      This is an interesting and timely story that brings together the tools of fruit fly systems neuroscience and links it with physiology. The experiments are well done and tell a very nice story. In particular, the primary message of the paper - that the authors have identified specific dopaminergic neurons that regulate cardiac activity - is sound.

      Weaknesses:

      There are no important problems with the scientific approach. Rather, there are some interpretive changes I would consider.

      (1) The changes in heart rate are small (10% or so), and, as far as I can tell, are evident for a beat or two. So the data may be better interpreted not as a change in rate but as a lengthening of diastole for a beat or two. That may seem a petty difference, but it might point to particular stretch-activated systems or changes in blood flow as the determinant.

      Heart rate must be averaged over time, and so might be blurring the effects. It may be useful to produce figures centered on beat count and duration rather than time. Because the effect may even be just on a single beat, we suggest the authors try plotting the average beat duration for each beat that follows the air puff. If it's really just the first beat, using a quantification of the change of this duration relative to the average that precedes the puff may produce more striking figures.

      (2) The author's model that cardiac deceleration leads to walking data is only partially supported by their data. In the first figure, the relationship between cardiac deceleration and walking probability seems to be inverted relative to their model (weak stimulus -> strong cardiac effect and weak locomotor effect; strong stimulus-> weak cardiac effect and strong locomotor effect). It is possible that this discrepancy may disappear when the authors look at beat duration rather than heart rate (for instance, if following the strong stimulus, there is a very long beat that is followed by tachycardia, thus weakening their observed HR change). It would also be easier to relate this data in Figure 1 to their interoceptive model if some data were shown that illustrated the relative timing of the cardiac change and the locomotor start.

      (3) Also, since the locomotor and cardiac changes are probabilistic, it would be very useful to see how their respective probabilities change when conditioned on the other. According to their interoceptive model, locomotion should preferentially increase on trials where cardiac deceleration occurs. The authors should discuss this incongruity and also potential alternative interpretations of their cardiac manipulation experiments. Perhaps the bradycardia makes them more sensitive to threats - as suggested in the introduction? Control flies show a mild increase in locomotion following green light (Figure 5j), so perhaps by slowing the heart, they are more sensitive and thus respond more strongly to this stimulus?

      (4) Looking at the example shapes of the beats in Figure 5g versus Figure 1c, the optogenetically induced diastole has a very different shape from the naturally occurring long beat. Thus, the exact cardiac stimulus may be unnatural. If this is true across trials and animals, it may be worth considering that the funny beat (like an anxiogenic atrial fibrillation in mammals) is the source of the fear and, in turn, locomotor behavior (also interesting!) rather than being a true replication of the cardiac events seen following the puff stimulus.

    1. eLife Assessment

      This study proposes that fitness level influences exercise-induced hypoalgesia in women. However, the evidence to support this claim is incomplete: the conclusions rely on a small interaction that emerges only under specific conditions and are incongruent with the title, the findings are inconsistent across pain modalities and stimulus intensities, the analysis approach does not fully exploit the continuous pain ratings collected, and the absence of a baseline condition limits the interpretability of results as reflecting true hypoalgesia. Additionally, the methods by which fitness level was categorized across cohorts can be questioned, and the results and figures do not clearly illustrate how between-group comparisons were conducted. With a proper revision, it could be useful for sports medicine practitioners to consider how they administer exercise protocols to help those experiencing pain.

    2. Reviewer #1 (Public review):

      Summary:

      The current study is a follow-up to a previously published study by the same research group (Nold et al. 2025). In the previous study, the authors had included a set of exploratory analyses which assessed the effects of fitness level (denominated by a relative FTP), sex, and drug treatment (Naxolone versus placebo). In this previous study, the authors state that "exploratory analysis showed a significant main effect of fitness level on differences in pain ratings in the [saline] condition... suggesting increased hypoalgesia with increasing fitness levels, pooled across all stimulus intensities".

      In the current study, the authors have recruited an additional 22 female participants (21 included in analysis) from local cycling clubs to assess if fitness level does indeed impact exercise-induced hypoalgesia responses to experimental thermal and pressure pain models.

      Strengths:

      The current study has the potential to present a convincing argument about the effect of fitness level and potentially other factors (e.g., sex) on exercise-induced hypoalgesia responses. Combining data across two of their primary studies would be highly fruitful to the research community interested in this area. Specifically, it has the potential to inform sports medicine practitioners and how they administer exercise protocols to help those experiencing pain with a further consideration for the fitness level (and maybe sex) of their patients.

      Weaknesses:

      However, the current study makes several bold claims about the role of fitness level and sex on exercise-induced hypoalgesia, which I do not believe that this study on its own - or in conjunction with the previously published study by the same authors - can make at present. Namely, the current study does not appear to conduct any specific analyses between the cohorts from either study (current and present). The results mention a difference in the group mean values in "fitness level" between cohorts, but the analysis itself on pain responses/exercise-induced hypoalgesia is limited only to the cohort from the current study. If the authors wanted to provide a convincing argument that fitness level has an effect on exercise-induced hypoalgesia, then the analysis of this study would have to include an analysis between the groups considered to be of "high" and "low" fitness level. I do not think the current study does this. Instead, it makes an assumption from the previous study (Nold et al. 2025) which only states that "exploratory analysis showed a significant main effect of fitness level on differences in pain ratings in the [saline] condition... suggesting increased hypoalgesia with increasing fitness levels, pooled across all stimulus intensities". The analysis of this study would have to include fitness level "high fitness" versus "low fitness" of participants across both studies in its statistical model to properly discern if fitness level has an impact on exercise-induced hypoalgesia.

      A similar comment can be made with respect to sex differences, as these have not been assessed in the analysis of this study either.

      Another area of weakness in this study is how "fitness level" has been demarcated across participants. One issue is how authors have assumed that the current cohort is 'fit', whereas the previous cohort was 'less fit', meaning that the authors could be coming to false conclusions about fitness level. In detail, figures within the current study show a large overlap between the 'fit' and 'less fit' cohorts, where some participants have a higher relative functional threshold power (FTP) in the 'less fit' cohort than the 'fit' cohort and vice versa. Therefore, I believe the authors should better demarcate between those that are in the 'more fit' and 'less fit' groups according to a validated and well-established criterion from the kinesiology and sport science literature. That being said, I think this may be problematic in some ways as FTP is considered a relatively poor measure to denote fitness levels, a limitation highlighted in the previous study's review.

      Altogether, whilst I commend the researchers on their body of work across the two studies, the current methods and analysis provide an incomplete assessment of their primary research question, and therefore, I would urge the authors to reconsider some of their methods/analysis and the framing of their results to better reflect the main research question they have attempted to answer. Likewise, I would recommend that readers ensure they consider the current results with caution until the authors have addressed some areas of concern which currently limit their main conclusions.

    3. Reviewer #2 (Public review):

      This study addresses an important question regarding exercise-induced modulation of pain in women, but the conclusions appear to be based on relatively limited and selective evidence. The authors report an interaction between exercise intensity and stimulus intensity, which they interpret as evidence for exercise-induced hypoalgesia and conclude that fitness, but not sex, modulates this effect. However, this main result relies on a relatively small interaction that emerges only under specific conditions, with inconsistent findings across pain modalities and stimulus intensities, and an analysis approach that does not fully exploit the continuous pain ratings collected. The lack of a baseline condition further limits the interpretability of the findings as reflecting hypoalgesia, and overall, the data provide a rather constrained basis for drawing broader conclusions.

      Strengths:

      (1) The focus on women is important and timely, particularly given the ambiguity in prior findings and the historical bias toward male-dominated samples.

      (2) The attempt to revisit previous findings in a new cohort is valuable in principle.

      Weaknesses:

      (1) The core interpretation may not be fully supported by the data

      The central claim-that the results demonstrate exercise-induced hypoalgesia and its dependence on fitness but not sex-does not appear to be fully supported by the evidence presented.

      1.1 Lack of baseline condition

      The absence of a no-exercise baseline substantially limits interpretation. The study compares high- and low-intensity exercise, but without a baseline, it is not possible to determine whether either condition produces hypoalgesia or hyperalgesia relative to calibration. The observed HI-LI difference, therefore, reflects only a relative contrast between exercise intensities, not an absolute reduction in pain. As a result, attributing the findings to "hypoalgesia" may be difficult to justify fully.

      1.2 Lack of internal replication across conditions

      The reported effect is highly specific and does not clearly generalise across the experimental design. It emerges significantly only for heat pain at the highest stimulus intensity, with no clear effects for other intensities and for pressure pain. Moreover, the main statistical result is a relatively small interaction effect with a modest p value, which translates into a difference of approximately 6-8 VAS units on a 150 scale. This combination-a small effect size, limited statistical strength, and restriction to a single condition-substantially weakens the evidence for a robust or generalisable effect.

      1.3 Deviations from the original study and selective use of data

      Although framed as a follow-up to previous work, the current study introduces substantial methodological changes, particularly in the acquisition and scaling of pain ratings (continuous vs post-hoc ratings, modified VAS with sub-threshold range). Despite collecting rich continuous data, the analysis focuses on peak responses to approximate the previous study. While this may aid comparability, it results in a strong emphasis on a single data point (highest intensity), rather than leveraging the full dataset. This limits both interpretability and comparability.

      1.4 Over-reliance on null results regarding sex differences

      The conclusion that fitness, but not sex, modulates exercise-induced pain may not be directly supported by the data presented. The current study includes only highly fit women, and comparisons with men or less-fit women rely on non-significant differences in a previous cohort. The absence of a significant difference does not provide evidence for equivalence, and no formal statistical support for a null effect is provided. As such, conclusions about the absence of sex differences would unfortunately benefit from more cautious interpretation.

      (2) Limited sample and lack of diversity

      The dataset is narrow in scope, comprising a small sample (N = 21) of healthy, highly fit women. Key demographic characteristics (e.g. age range, BMI distribution) are not fully presented, explored or discussed. This limits generalisability and makes it difficult to draw broader conclusions about exercise-induced pain modulation in women, as the main focus of the study.

      (3) Methodological choices limit the interpretability of the data

      Several methodological decisions would benefit from stronger justification:

      3.1 The use of a non-standard VAS scale (0-150 with a fixed pain threshold at 50) is unconventional and may influence how participants report pain, while limiting comparability with related literature.

      3.2 Participants explicitly reported expecting exercise to reduce pain, introducing a potential confound that is not presently addressed.

      3.3 A more comprehensive use of the full time series of pain ratings would provide a stronger and more transparent basis for interpretation of the present findings.

    1. eLife Assessment

      This study presents important findings on the relationship between nutrient availability and NAD/NADH levels, which in turn regulate biomass production in cancer cells. The authors provide convincing evidence to support their claims, offering insight into why it is difficult to predict which nutrients limit cancer cell growth: both cell type and nutrient availability together determine the oxidative capacity that constrains the synthesis of various metabolic intermediates. The manuscript will be of broad interest to researchers working in cancer and cell metabolism.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript investigates how cellular NAD/NADH ratios are controlled in cancer cell lines in vitro. The authors build on previous work, which shows that serine synthesis is sensitive to NAD/NADH ratios and PHGDH expression. Here, the authors demonstrate that serine synthesis is variable across a panel of cell lines, even when controlling for expression of serine synthesis enzymes such as PHGDH. The authors show that cellular NAD/NADH ratios correlate with the ability to synthesize serine and grow in serine-deprived environments when PHGDH levels remain constant. Investigating this variability in NAD/NADH ratios, the authors find that the cells that can positively respond to serine deprivation are able to increase oxygen consumption and cellular NAD/NADH ratios. Cells that do not increase oxygen consumption in response to serine deprivation do not increase NAD/NADH ratios and cannot grow well without serine. The authors go on to show that in cells with the ability to increase oxygen consumption upon serine deprivation, PHGDH expression alone is sufficient to fully restore growth-serine; in cells that cannot increase oxygen consumption, both PHGDH expression and interventions to increase NAD/NADH ratios are required to increase growth. Thus, cells need both PHGDH and NAD/NADH increases to maximize serine synthesis in response to serine deprivation. The authors previously showed that lipid synthesis likewise requires NAD regeneration. Interestingly, one cell line that does not increase oxygen consumption in response to serine limitation tends to increase oxygen consumption in response to lipid deprivation; accordingly, depriving this cell line of lipids increases the synthesis of serine. Together, these findings show that how cells respond to nutrient deprivation is highly variable and that the response to nutrient deprivation (for example, whether or not oxygen consumption is increased) will determine how well cells tolerate depletion of nutrients with related biosynthetic constraints. This work sheds light on the complexity of cancer cell metabolism and helps to explain why it is difficult to predict which nutrients will be limiting to any cancer cell type or environment.

      Strengths:

      (1) The authors use multiple interventions to manipulate NAD/NADH ratios in cells.

      (2) Experiments are well controlled and appropriately interpreted.

      Comments on revised version:

      The authors thoughtfully and thoroughly responded to all reviewer comments. The revised manuscript addresses the critiques.

    3. Reviewer #2 (Public review):

      In the manuscript "Cancer cells differentially modulate mitochondrial respiration to alter redox state and enable biomass synthesis in nutrient-limited environments", Chang et al investigate how cancer cells respond to the limitation of certain environmental nutrients by regulating the cellular NAD+/NADH ratio. They focus on serine and lipid metabolism, pathways known to be controlled by the NAD+/NADH ratio, and propose that changes in mitochondrial respiration in response to deprivation of these nutrients can influence the NAD+/NADH ratio, thereby impacting biomass synthesis.

      While the study is descriptive in nature and does not investigate specific molecular mechanisms that explain the crosstalk between nutrient availability and mitochondrial redox changes, the experimental component is robust, and the conclusions are well supported by the results. Some suggestions could further refine the conclusions and enhance the quality of the manuscript.

      Comments on revised version:

      The authors have provided a very comprehensive response. Their updated paper has improved, and the critiques have been mitigated.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study presents valuable findings on the relationship between nutrient availability and NAD/NADH levels, which in turn regulate biomass production in cancer cells. The authors provide solid evidence to support their claims, offering insight into why it is difficult to predict which nutrients limit cancer cell growth: both cell type and nutrient availability together determine the oxidative capacity that constrains the synthesis of various metabolic intermediates. The manuscript will be of interest to researchers working in cancer and cell metabolism.

      We thank the eLife Editor for evaluating our manuscript and for the positive comments.

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigates how cellular NAD/NADH ratios are controlled in cancer cell lines in vitro. The authors build on previous work, which shows that serine synthesis is sensitive to NAD/NADH ratios and PHGDH expression. Here, the authors demonstrate that serine synthesis is variable across a panel of cell lines, even when controlling for expression of serine synthesis enzymes such as PHGDH. The authors show that cellular NAD/NADH ratios correlate with the ability to synthesize serine and grow in serine-deprived environments when PHGDH levels remain constant. Investigating this variability in NAD/NADH ratios, the authors find that the cells that can positively respond to serine deprivation are able to increase oxygen consumption and cellular NAD/NADH ratios. Cells that do not increase oxygen consumption in response to serine deprivation do not increase NAD/NADH ratios and cannot grow well without serine. The authors go on to show that in cells with the ability to increase oxygen consumption upon serine deprivation, PHGDH expression alone is sufficient to fully restore growth-serine; in cells that cannot increase oxygen consumption, both PHGDH expression and interventions to increase NAD/NADH ratios are required to increase growth. Thus, cells need both PHGDH and NAD/NADH increases to maximize serine synthesis in response to serine deprivation. The authors previously showed that lipid synthesis likewise requires NAD regeneration. Interestingly, one cell line that does not increase oxygen consumption in response to serine limitation tends to increase oxygen consumption in response to lipid deprivation; accordingly, depriving this cell line of lipids increases the synthesis of serine. Together, these findings show that how cells respond to nutrient deprivation is highly variable and that the response to nutrient deprivation (for example, whether or not oxygen consumption is increased) will determine how well cells tolerate depletion of nutrients with related biosynthetic constraints. This work sheds light on the complexity of cancer cell metabolism and helps to explain why it is difficult to predict which nutrients will be limiting to any cancer cell type or environment.

      Strengths:

      (1) The authors use multiple interventions to manipulate NAD/NADH ratios in cells.

      (2) Experiments are well controlled and appropriately interpreted.

      Weaknesses:

      Overall the data support the conclusions of the manuscript. I have only two minor comments and suggestions:

      We thank Reviewer 1 for their insightful comments, which have helped us improve the manuscript.

      (1) Figure 2B/C: data are presented as relative to +serine, which shows how some cells respond to -serine, but may also be of interest to see how absolute (not relative) NAD/NADH levels correlate with serine synthesis and serine-independent proliferation. In other words, is it the dynamic increase in the ratio that is most important, or the absolute level of the ratio?

      We thank Reviewer 1 for raising this point about whether it is the absolute NAD+/NADH ratio, or the change in NAD+/NADH ratio, that is important for increasing serine synthesis and allowing proliferation under serine depleted conditions. We reported relative ratios for accessibility to a general audience, but agree that this information is informative and should be presented. We assessed the NAD+/NADH ratio using an enzymatic assay, which does not directly measure absolute concentrations of NAD+ or NADH (PMID: 26232225). However, we previously confirmed the assay is in the same linear range for both NAD+ and NADH, and thus is valid for assessing the NAD+/NADH ratio. We now provide the unnormalized NAD+/NADH ratio data in Supplementary Figure 2G of the revised manuscript. This shows that the considered cells exhibit a range of NAD+/NADH ratios, and redox responsive cells do not cluster in having a higher or lower NAD+/NADH ratio.

      To more formally answer Reviewer 1’s question about whether the absolute ratio or change in ratio is important for increasing serine synthesis, we measured the correlation coefficient between the unnormalized NAD+/NADH ratios and the proliferation rate of all examined cancer cells cultured with or without serine. These data are presented in Author response image 1. Of note, we find that there is a significant positive correlation between the raw values of the measured NAD+/NADH ratio and proliferation rate in both serine-replete (r = .371) and serine depleted (r = .562) conditions. However, this correlation is not strong, and when examining the cancer cells whose proliferation in serine depleted conditions cannot be fully explained by serine synthesis enzyme expression (Calu6, 8988T, A549, MIA PaCa-2, H1299, and HCT116), there is no significant correlation between the raw NAD+/NADH ratio and proliferation rate in serine depleted conditions. The association between the relative change in the NAD+/NADH ratio and proliferation rate is much stronger upon serine deprivation (r = .571), as presented in Figure 2C of the revised manuscript. This suggests that the dynamic increase in the ratio is more tightly linked to the change in serine synthesis rate and proliferation in serine depleted environments, and we discuss this point in the revised manuscript with the following text:

      “Of note, whether the NAD+/NADH ratio of a cell was more or less oxidized in serine-replete conditions was not predictive of response to serine withdrawal (Supplementary Figure 2G).” (Lines 163-165)

      Author response image 1.

      Correlations between unnormalized NAD+/NADH ratios and cell proliferation rates between (A) all cancer cells examined (Calu6, MCF7, MDA-MB-231, A549, 8988T, MIA PaCa-2, A375, H1299, HCT116, MDA-MB-231 with PHGDH overexpression) in serine-replete conditions, (B) all cancer cells examined in serine depleted conditions, and (C) select cancer cells (labeled in gray) where serine synthesis enzyme protein expression does not fully explain proliferation in serine depleted conditions. Pearson correlation coefficient and P values were calculated by simple linear regression, *p<0.05, **p<0.01. Data shown are means of three biological replicates ± SD.

      (2) Line 177-178: the authors write, "We hypothesized that the elevated NAD+/NADH ratio represented a cellular response to make the NAD+/NADH ratio more oxidized to enable serine synthesis". I recommend modest edits to avoid anthropomorphizing. It is possible that the ratio responds for reasons yet to be determined and not necessarily because the cell is deliberately trying to enable serine synthesis.

      We thank Reviewer 1 for raising this point. We agree that our data do not show whether the ratio is elevated for the deliberate purpose of enabling serine synthesis and have edited the text accordingly with the following edit to that line of the revised manuscript:

      “We hypothesized that a more oxidized NAD+/NADH ratio could support greater serine synthesis and thus sought to identify the processes that increase the NAD+/NADH ratio in some but not all cancer cells.” (Lines 190-192)

      Reviewer #2 (Public review):

      In the manuscript "Cancer cells differentially modulate mitochondrial respiration to alter redox state and enable biomass synthesis in nutrient-limited environments", Chang et al investigate how cancer cells respond to the limitation of certain environmental nutrients by regulating the cellular NAD+/NADH ratio. They focus on serine and lipid metabolism, pathways known to be controlled by the NAD+/NADH ratio, and propose that changes in mitochondrial respiration in response to deprivation of these nutrients can influence the NAD+/NADH ratio, thereby impacting biomass synthesis.

      While the study is descriptive in nature and does not investigate specific molecular mechanisms that explain the crosstalk between nutrient availability and mitochondrial redox changes, the experimental component is robust, and the conclusions are well supported by the results. Some suggestions could further refine the conclusions and enhance the quality of the manuscript.

      We thank Reviewer 2 for their time and for their suggestions to improve the manuscript.

      Main critiques:

      (1) Throughout the manuscript, the authors utilise the number of cell doublings per day as an endpoint readout of cell proliferation. It would be advisable to include a quantification of the cell number and to display the proliferation rate over time. This would provide valuable insights into the timeline of cellular responses and avoid potential confounding effects associated with the use of Sulforhodamine B dye, an indirect measure of cell proliferation based on protein content, which may be influenced by some of the interventions. Furthermore, it will help determine whether specific treatments reduce cellular doublings resulting from cell death. This concern is particularly evident in treatments with rotenone, e.g., Fig. 1G, where the increase in doublings could be attributed to cell death.

      We thank the reviewer for this suggestion and agree that assessment of cell count provides additional information beyond Sulforhodamine B dye as an indirect measure of proliferation. To address this, we directly measured cell number over time using Incucyte Live-Cell imaging analysis applied to A549 and H1299 cells cultured with or without serine for 72 hours. Consistent with results using sulforhodamine B, A549 cells doubled at a rate of 0.874 per day and H1299 cells doubled at a rate of 1.034 per day in serine-replete conditions. In serine depleted conditions, A549 cells doubled at a rate of 0.205 per day while H1299 cells doubled at a rate of 0.544 per day. We have added the cell number measurements over time as well as the corresponding calculated doublings per day in Supplementary Figure 2D and Supplementary Figure 2E of the revised manuscript.

      We also agree with Reviewer 2 that serine deprivation and rotenone treatment could potentially impact cell viability, which might confound phenotypes, including NAD+/NADH ratio measurements. To assess whether serine deprivation and rotenone treatment cause cell death, we measured cell viability using Sytox Green after exposing cells to these conditions for 72 hours. We find that there is indeed more cell death in cells cultured without serine at most concentrations of rotenone. However, cell death did not exceed 4% in any of the conditions tested, suggesting this is not a major contributor to the cell doubling phenotypes. These data are now presented in Supplementary Figure 1C of the revised manuscript. However, in light of Reviewer 2’s comments, along with a comment from Reviewer 3 about whether rotenone induces ROS and cellular stress responses, we have decided to remove the proliferation data involving rotenone that were in Figure 1F and 1G of the original manuscript. The rationale is that the potential confounding impacts of rotenone on viability make interpreting the proliferation data difficult. Instead, we have focused Figure 1 of the revised manuscript on the observation that there is specifically a correlation between the cell NAD+/NADH ratio and serine synthesis.

      (2) The authors propose a model in which the deprivation of extracellular nutrients impacts mitochondrial respiration, which in turn increases the NAD+/NADH ratio and ultimately affects metabolic biosynthetic pathways that occur in the cytosol, such as serine biosynthesis. The mechanism by which nutrient availability is sensed and transmitted across different cellular compartments to regulate mitochondrial redox status remains unclear. This concern is particularly relevant for serine metabolism, as its synthesis occurs in the cytosol, but the authors connect it to mitochondrial respiration. Compartment-specific measurements of NAD+/NADH ratio would help to understand to what extent the redox state is affected by nutrients in the mitochondria and in the cytoplasm (see also minor critiques point 2). Moreover, the use of the genetic tool LbNox could be employed to manipulate the NAD+/NADH ratio in a compartmentspecific manner, while also avoiding the toxicity of certain compounds, such as rotenone. This set of experiments would add depth to the investigation, which might otherwise appear too descriptive.

      (A) Compartment-specific measurements of NAD+/NADH ratio would help to understand to what extent the redox state is affected by nutrients in the mitochondria and in the cytoplasm

      The question of how nutrient availability is sensed and transmitted across cellular compartments to impact mitochondrial respiration is important. However, rigorous assessment of compartment-specific metabolism is quite challenging, as we are not aware of tools to accurately measure redox ratios in a compartment-specific manner. Direct assessment of cofactor levels in subcellular compartments requires long isolation times and are unlikely to be accurate (PMID: 27565352). Rapid immunopurification of mitochondria has been used to estimate metabolite levels and ratios, but accurate measurements are hindered by rapid oxidation of NADH to NAD+. The use of fluorescence lifetime imaging (FLIM) to monitor NADH levels does not allow for accurate monitoring of the NAD+/NADH ratio as NAD+ cannot be visualized and NADH cannot be distinguished from NADPH. Additionally, the resolution of FLIM to interrogate compartment-specific signals is limited (PMID: 38594590). Fluorescent sensors, such as SoNar, have been used to image the NAD+/NADH ratio in compartments, though SoNar is sensitive to pH changes, which vary across compartments, and it has been argued that these sensors are more suitable for qualitative, not quantitative, changes in the NAD+/NADH ratio (PMIDs: 25955212, 29181426). It has also been argued that sensors are not amenable to measurement of mitochondrial ratios, as the predicted ratios are too reduced for the range of the sensors. Given these technical limitations, we opted to attempt a rapid subcellular fractionation (~25 second process to separate cytoplasm and mitochondria) followed by enzyme-based measurements of the NAD+/NADH ratio (PMID: 36883551), acknowledging the limitations of this approach. We find that across both A549 and H1299 cells, the mitochondrial NAD+/NADH ratio is lower than the cytosolic NAD+/NADH ratio, as expected. Using this approach, we find that in A549 cells, serine depletion leads to a decreased cytosolic NAD+/NADH ratio compared to serine-replete conditions while having no impact on the mitochondrial NAD+/NADH ratio. On the other hand, serine depletion leads to an elevated cytosolic NAD+/NADH ratio in H1299 cells while also having no impact on the mitochondrial NAD+/NADH ratio. In parallel, we used extracellular pyruvate exposure as a positive control, which should support cytosolic NAD+ regeneration, and rotenone as a negative control, which should suppress mitochondrial NAD+ regeneration. We show that pyruvate led to an elevated cytosolic NAD+/NADH ratio whereas rotenone treatment led to a decreased cytosolic NAD+/NADH ratio. Despite rotenone inhibiting complex I of the electron transport chain, we did not observe a change in the mitochondrial NAD+/NADH ratio (Author response image 2). This likely indicates that this assay is not sensitive enough to detect changes in mitochondrial NAD+/NADH, and we opted not to include these data in the revised manuscript given the limitations of the approach.

      Author response image 2.

      Rapid subcellular fractionation to examine compartment-specific NAD+/NADH ratios. (A) Cytosolic and mitochondrial NAD+/NADH ratios of A549 cells grown with or without serine for 24 hours, n=3. (B) Cytosolic and mitochondrial NAD+/NADH ratios of H1299 cells grown with or without serine for 24 hours, n=3. (C) Cytosolic and mitochondrial NAD+/NADH ratios of H1299 cells treated with either 1 mM pyruvate or 50 nM rotenone for 24 hours, n=3. P-values were calculated using a Student’s t-test, *p<0.05, **p<0.01. Data shown are means ± SD.

      We nevertheless draw the following conclusions from these data:

      (1) Changes to mitochondrial NAD+/NADH either do not occur or are not captured with this approach. Even rotenone treatment, which inhibits complex I and might be expected to change mitochondrial redox state, does not change the measured mitochondrial NAD+/NADH ratio.

      (2) The whole cell NAD+/NADH ratio most likely reflects changes in the cytosolic NAD+/NADH ratio. While observing no impact on the mitochondrial NAD+/NADH ratio after rotenone treatment, we still find the cytosolic NAD+/NADH ratio is decreased. Moreover, both pyruvate and serine depletion led to an elevated cytosolic NAD+/NADH ratio in H1299 cells, which we observe at the whole cell level.

      (3) H1299 cells depleted of serine elevate the cytosolic NAD+/NADH ratio, while rotenone treatment decreased the cytosolic NAD+/NADH ratio despite changes in mitochondrial respiration. This suggests that redox shuttles, such as the malate aspartate shuttle, play a role in communicating changes in mitochondrial redox dynamics to the cytoplasm. We test this hypothesis as described in response to Reviewer 2, point B, below.

      (B) The mechanism by which nutrient availability is sensed and transmitted across different cellular compartments to regulate mitochondrial redox status remains unclear

      Multiple known shuttles are involved in exchanging redox equivalents between the mitochondria and the cytosol. It is likely that multiple shuttles are involved, or could be involved in the right context, but one major shuttle is the malate aspartate shuttle (MAS), and the MAS has been shown previously to support de novo serine synthesis (PMID: 37647199). Thus, we hypothesized that the MAS is involved in the response involving elevated mitochondrial respiration in H1299 cells to increase the whole cell NAD+/NADH ratio upon serine deprivation. To test this, we used CRISPR/Cas9 to generate H1299 cells lacking MAS components GOT1, MDH1, or GOT2 and measured the cell NAD+/NADH ratio. We did not knock out MDH2 given its integral role in the TCA cycle. We find that when MDH1 and GOT2 are knocked-out, H1299 cells no longer exhibit elevated whole cell NAD+/NADH ratios upon serine deprivation. Consistently, removing MDH1 and GOT2 also blunted the increase in oxygen consumption as well as the increase in serine synthesis upon serine deprivation. This suggests that MDH1 and GOT2 activity though the MAS support the process by which mitochondrial NAD+ regeneration is transmitted to the cytoplasm to support serine synthesis. We have added these data as Supplementary Figure 7 in the revised manuscript.

      (C) Moreover, the use of the genetic tool LbNox could be employed to manipulate the NAD+/NADH ratio in a compartment-specific manner

      We thank Reviewer 2 for the suggestion to consider whether LbNOX might be used to manipulate the NAD+/NADH ratio in a compartment-specific manner. We expressed LbNOX in both the cytoplasm and the mitochondria of A549 (serine non-responsive) cells. We predicted that if LbNOX expression, either in the cytoplasm or the mitochondria, affected the NAD+/NADH ratio, proliferation in serine depleted conditions might be improved. However, we found that expressing LbNOX in the cytoplasm or the mitochondria of A549 cells had no effect on the NAD+/NADH ratio. Thus, LbNOX expression in either compartment also did not change proliferation in serine depleted conditions. These data are consistent with the known limitations of this genetic tool. While LbNOX can increase NADH oxidation in response to some interventions like rotenone, it does not necessarily change the NAD+/NADH ratio of unperturbed cells. This was reported in the original description of LbNOX (PMID: 27124460). We confirmed that LbNOX was successfully expressed via immunoblotting, and also confirmed that LbNOX functioned by showing either cytoplasmic or mitochondrial LbNOX expression improves cell proliferation following complex I inhibition. Thus, expressing LbNOX in A549 cells is not informative for understanding compartment specific metabolism following serine deprivation. Nevertheless, as this question is likely to come up for other readers, we have included these data as Supplementary Figure 6 in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      Minor critiques:

      (1) It seems clear from the authors' data that the response to serine depletion in terms of cell proliferation is not determined exclusively by PHGDH levels. It would be useful to measure the levels of the other two enzymes in the serine synthesis pathway and also to measure serine uptake under normal conditions in the different groups of cells. This information could provide some insight into the different responses of cancer cell lines to serine deprivation.

      (A) It would be useful to measure the levels of the other two enzymes in the serine synthesis pathway

      Reviewer 2 raises a fair point, and we agree that measuring levels of other enzymes in the serine synthesis pathway is informative. Thus, we measured the expression of phosphoserine aminotransferase 1 (PSAT1) and phosphoserine phosphatase (PSPH) across all cancer cells examined and find that, similar to PHGDH protein expression, PSAT1 and PSPH protein expression is lower in many cancer cells that are more sensitive to serine withdrawal (e.g. MCF7). However, among the cancer cells where PHGDH protein expression did not explain the response to serine withdrawal, the protein expression of PSAT1 and PSPH also did not explain how well the cells proliferate without environmental serine. These data have been included in Supplementary Figure 2B of the revised manuscript.

      Of note, we measured serine synthesis enzyme expression for the six cancer cell lines whose proliferation in serine depleted conditions better correlated with a change in the NAD+/NADH ratio than it did with PHGDH expression: Calu6, 8988T, A549, MIA PaCa2, H1299, and HCT116. For these cells, we correlated proliferation upon serine depletion with PHGDH, PSAT1, and PSPH protein expression and found that interestingly, there was a significant negative correlation between PHGDH protein expression and proliferation upon serine deprivation. This was not observed for PSAT1 expression, and a statistically significant positive correlation between proliferation and PSPH protein expression was noted, though the variation in PSPH protein expression was large. We have added these correlation data to the revised manuscript as Supplementary Figure 2F.

      (B) It would be useful to measure…serine uptake under normal conditions in the different groups of cells

      Per the Reviewer’s request, we performed absolute quantification of serine uptake rates in serine-replete conditions for three serine “non-responder” cancer cells (Calu6, 8988T, A549) and three serine “responder” cancer cells (MIA PaCa-2, H1299, HCT116). We did not observe a notable difference in serine uptake rate and whether cells responded to serine deprivation. Additionally, with the exception of 8988T cells having a higher serine uptake rate than the other cells, there was no statistical difference in serine uptake across the cancer cells tested (Author response image 3).

      Author response image 3.

      Basal serine uptake rate of exponentially growing cells in serine replete conditions. Serine levels were measured using GC MS before and after 24 hours of serine depletion and normalized by area under the growth curve (PMID: 26954548). P-values were calculated using one-way ANOVA followed by a post-hoc Tukey HSD test, *p<0.05, **p<0.01

      (2) The authors experimentally demonstrated that some cancer cells respond to serine depletion with an increase in mitochondrial respiration, but the molecular mechanism behind this is not addressed. There is some evidence in the literature showing that serine acts as an activator of the glycolytic enzyme PKM, which is coherent with an increased mitochondrial respiration in the absence of serine (PMID: 23064226). The authors could discuss their findings in the context of this paper. Additionally, they could provide some insights about baseline mitochondrial activity in the different cell lines. Indeed, it seems that "redox responsive cells" might have an overall increased basal OCR.

      We appreciate the suggestion that pyruvate kinase M (PKM) may mediate the elevation in mitochondrial respiration in response to serine depletion. Given that serine is an allosteric activator of PKM, and PKM suppression can increase mitochondrial OCR, we discuss this possibility in the Discussion section of the revised manuscript using the following text:

      “Interestingly, serine is an allosteric activator of the glycolytic enzyme pyruvate kinase, which converts phosphoenolpyruvate to pyruvate and generates ATP (Chaneton, 2012). Thus, decreased environment serine availability in addition to differences in pyruvate kinase activity may yield lower glycolytic ATP, resulting in greater mitochondrial respiration in serine redox responder cancer cells.” (Lines 443-447)

      Additionally, we appreciate the reviewer’s observation that redox responsive cells may have an overall increased basal respiration rate. We directly measured mitochondrial dependent oxygen consumption in the same assay to test whether redox responsive cells exhibit higher mitochondrial respiration. We find that while the redox responsive H1299 and MIA-PaCa2 cells have higher mitochondrial respiration than non-responsive cells, HCT116 cells that are also redox responsive to serine deprivation, did not exhibit higher mitochondrial respiration compared to redox non-responsive Calu6, 8988T, and A549 cells (Author response image 4). However, when comparing redox non-responders versus responders as a whole, there was a statistically significant difference in basal OCR. Together, this suggests that basal mitochondrial respiration rate in serine-replete conditions may be related in some cases to whether cancer cells elevate mitochondrial respiration and the NAD+/NADH ratio upon serine deprivation, but this cannot be the full explanation given the HCT116 cell data. We also acknowledge the reviewer’s statement that we do not understand the molecular mechanism by which respiration responds to serine deprivation and explicitly state this in the revised manuscript.

      Author response image 4.

      Basal Oxygen consumption rate (OCR) of cancer cells in serine-replete conditions. (A) Kinetic OCR measurements of cancer cells before and after rotenone and anti-mycin injection, n=8. Data shown are means ± SD. (B) Quantified mitochondrial OCR (removing residual OCR), n=8. Values are averages obtained over three measurements. P-values were calculated via nested ANOVA, ****p<0.001

      (3) There is a discrepancy between the basal values of the OCR from the same cell lines in different experiments, i.e., Figure 3A and Supp. Figure 3C, or in different experiments, Figure 3A, Figure 5E, and Figure 6A. The authors need to comment on/clarify that. Moreover, authors are encouraged to show ECAR values to support the conclusion that lactate production is not differentially affected by serine depletion, and thus, does not contribute to the increase in the NAD+/NADH ratio.

      We recognize the differences in basal OCR values across different experiments. Given experiment-to-experiment variation and the need for different cartridges for each Seahorse experiment, we have found that measured OCR values using Seahorse assays vary across experiments despite the same conditions. Additionally, while we aim to seed the same number of cells per assay, cell seeding and cell quantification after each Seahorse assay can contribute to variation. Given this variability on a per-assay basis, we performed a singular experiment across all examined cancer cell lines considered to minimize variation in oxygen sensor calibration and address the reviewer question about whether absolute differences might contribute to response. These data are shown in Author response image 4.

      Regarding the reviewer’s request to present ECAR data, we note that measuring ECAR is dependent on using unbuffered media and for this reason do not routinely measure ECAR. Our concern is that removing serum from the culture conditions can impact OCR measurements, and we instead prioritized maintaining the same media composition across all sets of experiments (i.e., cell proliferation assays, NAD+/NADH assays, kinetic tracing assays, and OCR measurements). Additionally, we point out that ECAR does not directly measure lactate. We refer the Reviewer to data included in the manuscript where GC-MS was used to directly measure lactate secretion over time for cells cultured with or without serine. These data are presented as Supplementary Figure 3B in the revised manuscript.

      (4) There seems to be also a discrepancy between the levels of M+2 citrate and the fraction labelled (Figure 5C versus Supplementary Figure 6C) in the H1299 cell line upon serine depletion, whereby the M+2 fraction seems unexpectedly lower in serinedeprived cells. In those conditions, H1299 cells showed an increased mitochondrial respiration, which is consistent with increased total citrate levels. This could be explained by a faster TCA cycle activity and the presence of higher-order isotopologues of citrate upon serine starvation. Is this the case? Showing the abundance of the different citrate isotopologues and their contribution to the total pool would help to interpret the results.

      We thank Reviewer 2 for this thoughtful comment regarding the discrepancy between M+2 citrate produced (normalized ion counts per cell) versus fraction of the total intracellular citrate pool that is M+2 labeled in serine depleted H1299 cells. In our kinetic U-<sup>13</sup>C-glucose tracing experiments, where we performed isotope labeling for up to 15 minutes, we only see a greater presence of M+3 citrate from fully labeled glucose without robust changes in M+4, M+5, or M+6 citrate (Author response image 5). An elevated M+3 citrate could represent pyruvate carboxylase activity, where M+3 labeled pyruvate is converted to M+3 oxaloacetate that then reacts with unlabeled acetyl-CoA to generate M+3 citrate.

      We also find that the total citrate pool in H1299 cells is elevated upon serine depletion (see Supplementary Figure 6C in the original manuscript). Thus, the fractional contribution of an isotope to the citrate pool may decrease despite an increase in the amount of the particular isotope. In the original manuscript, we included data from kinetic U-<sup>13</sup>C-glutamine tracing in H1299 cells cultured with or without serine (Supplementary Figure 6I,J of the original manuscript). We find that H1299 cells depleted of serine exhibit greater M+4 citrate (via oxidative decarboxylation) and greater M+5 citrate (via reductive carboxylation) compared to serine-replete H1299 cells. Thus, one other potential explanation for why M+2 citrate from kinetic U-<sup>13</sup>C-glucose tracing represents a lower fraction of the total citrate pool in serine depleted H1299 cells is because there is a larger contribution from glutamine to the citrate pool. While there was no difference in the fraction of the citrate pool that consists of M+4 citrate, there was a greater fraction of the citrate pool labeled by M+5 citrate upon kinetic U-<sup>13</sup>C-glutamine tracing in serine depleted H1299 cells (see Author response image 6A, B). There was also a greater fraction of the citrate pool from M+6 citrate upon kinetic U-<sup>3</sup>C-glutamine tracing in serine depleted H1299 cells (Author response image 6C). This would require M+3 pyruvate labeling from glutamine, which may be due to malic enzyme, which converts M+4 malate to M+3 pyruvate. M+3 pyruvate may also be formed by PEPCK, which could convert M+4 oxaloacetate to M+3 phosphoenolpyruvate, leading to M+3 pyruvate. While understanding the source of M+6 citrate from glutamine is out of the scope of this study, it may highlight an interesting metabolic shift in H1299 cells depleted of serine that could elevate the total intracellular citrate pool.

      Author response image 5.

      Citrate isotopologues (A. M+3; B. M+4; C. M+5; D. M+6) from kinetic U-<sup>13</sup>C-glucose tracing in H1299 cells depleted of serine for 24 hours. For all measurements, citrate values were normalized to internal norvaline standard and cell number for each condition, n=3. Data shown are means ± SD.

      Author response image 6.

      Fraction of the citrate pool labeled by U-<sup>13</sup>C-glutamine in H1299 cells depleted of serine for 24 hours. (A) Fraction of the total citrate pool that is M+4 citrate (formed via oxidative decarboxylation), n=3. (B) Fraction of the total citrate pool that is M+5 citrate (formed via reductive carboxylation), n=3. (C) Fraction of the total citrate pool that is M+6 citrate, n=3. Data shown are means ± SD.

      (5) The lipid depletion part of the paper seems to be somewhat tangential. The effect of lipid depletion on the NAD+/NADH ratio in A549 cells is modest, and the effects of dual serine and lipid depletion on OCR and NAD+/NADH ratio are not consistent. Moreover, if the authors want to show that these different nutritional environments affect lipid synthesis, apart from glucose incorporation into citrate, they would need to show actual carbon incorporation into palmitate, probably at longer time points.

      We apologize for the lack of clarity for how mitochondrial respiration and the NAD+/NADH ratio play a role in governing glucose oxidation to citrate. To better highlight our logic and rationale for investigating alterations in NAD+/NADH homeostasis and citrate synthesis under lipid depletion, we have added the following text to the revised manuscript:

      “Oxidative biosynthetic reactions other than serine synthesis can also be constrained by the NAD+/NADH ratio. For example, cancer cells deprived of environmental lipids increase oxidative citrate production, and we have previously found that citrate synthesis, either through glucose oxidation or glutamine oxidation, is limited by NAD+ availability (Li, 2022) (Figure 5A, Supplementary Figure 8A). Thus, we sought to uncover whether the increase in the cell NAD+/NADH ratio by mitochondrial respiration in response to serine withdrawal specifically supports greater serine synthesis or also leads to greater oxidative citrate production.” (Lines 307-313)

      While we have previously shown that alterations to the NAD+/NADH ratio can modify both citrate production and palmitate synthesis under lipid depleted conditions (PMID: 35739397), we agree with Reviewer 2 that no conclusion can be made about lipid synthesis without direct measurements and have revised the manuscript accordingly.

      (6) In Figure 6C-6F, showing the results of the controls (+serine +lipids) will help to clarify the extent to which serine and citrate synthesis rates are affected by the different interventions.

      We thank the reviewer for the comment. Because we specifically asked how dual serine and lipid starvation impacted either serine or citrate synthesis compared to singular nutrient deprivation alone, we performed the experiments focusing on these conditions. We felt that conducting an experiment that specifically targeted our question would be make the findings more accessible as we had compared the +serine +lipid conditions to either serine or lipid depletion alone earlier in our manuscript (Figure 2D and Figure 5G,H of the revised manuscript).

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang and colleagues provides new insights into how cancer cells adapt their metabolism under nutrient-deprived conditions. They find cells respond differentially to serine and lipid deprivation via oxidising the cell redox state, which enables biomass synthesis and cell proliferation. They identified mitochondrial respiration as the major mechanism that dictates the endogenous NAD+/NADH ratio. By incorporating a dual stress paradigm, serine and lipid deprivation, the study further suggests that the NAD+/NADH ratio can serve as a link to orchestrate the complex interplay between multiple nutrient changes in the tumour microenvironment.

      Strengths:

      A novel aspect of this study is the idea that cancer cells are not uniformly passive victims of nutrient limitation; some can actively invoke endogenous NAD+ regeneration to combat nutrient stress. The conclusion is well-supported by comparing multiple cell lines from different tissues and genetic backgrounds, which improves generalizability. While most of the smaller conclusions align with common reasoning and expectations, the step-by-step deduction that leads to a novel 'big picture' is commendable. Another notable strength is the integration of dual stress (lipid and serine deprivation), which better mimics the complex tumor microenvironment with multiple nutrient fluctuations, raising the translational potential of these findings. The observation that lipid-deprived cells can stimulate serine synthesis and support proliferation in a subset of cancer cell lines offers a novel perspective on metabolic plasticity under starvation conditions.

      We thank Reviewer 3 for their time and for their comments to help us improve the manuscript. We also thank them for highlighting the strengths and significance of our findings.

      Weaknesses:

      (1) Although the authors derive a novel and valuable overarching concept, the presentation of this "big picture" is not clearly articulated, making it less accessible to readers outside the immediate field. It would greatly enhance the manuscript to include a clearer summary of the overarching model and its implications. Additionally, discussing the potential clinical significance and applications of the findings would increase the relevance and broader impact of the work. Finally, the manuscript's clarity and credibility are undermined by inconsistent figure labeling and the lack of statistical analysis, particularly for the Western blot data.

      (A) It would greatly enhance the manuscript to include a clearer summary of the overarching model and its implications. Additionally, discussing the potential clinical significance and applications of the findings would increase the relevance and broader impact of the work.

      We appreciate Reviewer 3’s suggestion to help clarify the findings of this study. To better articulate our overarching model, we have added the following text to the end of the Results section of the revised manuscript

      “Taken together, we propose a model where environmental nutrient availability can impact mitochondrial respiration based on the specific cancer. Because mitochondrial respiration is a major pathway that regenerates NAD<sup>+</sup>, changes to mitochondrial respiration can alter the cell NAD+/NADH ratio, influencing the activity of major NAD<sup>+</sup>-requiring metabolic reactions such as serine synthesis and citrate synthesis that can be important for proliferation. We further propose that changes to the cell NAD+/NADH ratio can impact all oxidative biosynthetic reactions if the enzyme machinery is present, but that specificity for how the cell NAD+/NADH ratio changes is dependent on both cell-intrinsic factors and cellextrinsic factors (Figure 7)." (Lines 396-404)

      Additionally, a new model figure was added as Figure 7 in the revised manuscript, which may help with understanding for a general audience.

      To better highlight the potential clinical significance of these findings, we have added the following at the end of the Discussion section of the revised manuscript:

      “Better understanding the mechanisms cells use to alter respiration and adjust the NAD+/NADH ratio in response to available nutrients could inform the complex interplay between cell-intrinsic and cell-extrinsic factors that determine cancer metabolic dependencies. This is particularly important to consider when targeting metabolism for cancer treatment. Many newer therapies targeting metabolism have not been successful in part because of metabolic plasticity to nutrient shifts (Amoedo, 2017; Fendt, 2020; Xiao, 2023). Co-targeting mitochondrial function limits metabolic adaptations and may also help predict the tissue nutrient conditions that result in pathway dependencies for specific cancers. Thus, better understanding how the cell NAD+/NADH ratio responds to nutrient levels in different cancers could improve selection of patients for cancer therapies that impact metabolism.” (Lines 483-492)

      (B) “…the manuscript's clarity and credibility are undermined by inconsistent figure labeling and the lack of statistical analysis, particularly for the Western blot data.”

      We apologize to the reviewer for any inconsistency in data presentation. To address the comment related to inconsistent figure labeling, we ensured all figures in the revised manuscript are labeled to allow readers to recognize what cell lines are used, what conditions are tested, what parameters are measured, and how the data may or may not be normalized. To address the reviewer’s comments about lack of statistical analysis, in the revised manuscript we ensured that statistical analyses are included for data presented in each figure, when appropriate. We also include a section titled “Statistics and Reproducibility” in the Methods section. In our revised manuscript, we have ensured that the p-value threshold is consistent throughout all figures, and have removed “ns” across the manuscript for consistency as suggested by Reviewer 3 in their minor comments. We also removed any explicit p-values included in figures where the p-values were close to reaching the threshold for significance (a=0.05). We have also performed additional statistical analyses where needed, including adding the pvalues for linear regression analyses, and ensured new data added to the revised manuscript also included appropriate statistical analyses.

      For western blot data, we show representative immunoblots. However, we measured PHGDH, PSAT1, and PSPH protein expression in three biological replicates across examined cancer cells and quantified the average serine synthesis protein expression from each replicate performed with error bars that denote standard deviation (see Author response image 7). We performed a nested ANOVA to examine whether there was a statistically significant difference in PHGDH, PSAT1, and PSPH protein expression between non-responder and responder cancer cells. Interestingly, as noted in our response to Reviewer 2, we find a significant negative association between PHGDH protein expression and response to serine deprivation among the six cancer cells where PHGDH protein expression did not explain proliferation upon serine depletion.

      Author response image 7.

      Serine synthesis enzyme protein expression in serine-replete and serine depleted cancer cells. (A) Immunoblots examining the expression of PHGDH, PSAT1, and PSPH in cancer cells as shown. HSP90 was used as a loading control. Data are from two separate biological replicates. (B) Mean levels of PHGDH, PSAT1, and PSPH normalized to loading control HSP90 across cancer cells from three separate biological replicates. Yellow denotes cancer cells that do not elevate mitochondrial respiration in response to serine depletion (non-responders). Blue denotes cancer cells that do elevate mitochondrial respiration in response to serine depletion (responders). P-values were calculated with nested ANOVA comparing non-responders and responders, **p<0.01

      (2) While this study identifies changes in serine synthesis, mitochondrial respiration, PHGDH protein levels, and NAD+/NADH ratio in different cell lines, some of these relationships appear correlative rather than causally established (Figure 2; Figure 5; Figure 6). Some claims are thus overinterpreted. For example, the co-occurrence of increased NAD+/NADH ratio and citrate levels under lipid deprivation in A549 cells does not establish causality (Figure 5). Direct perturbation experiments that manipulate NAD+/NADH and assess downstream effects on citrate synthesis would substantially strengthen the conclusions.

      We agree with Reviewer 3 that corresponding changes in proliferation, mitochondrial respiration, and serine synthesis are correlated to the NAD+/NADH ratio. As shown in Figure 4, we perturbed the NAD+/NADH ratio with FCCP and rotenone to measure downstream effects on serine synthesis. We also agree with the reviewer that doing similar experiments in the lipid depletion condition would highlight the relationship between the NAD+/NADH ratio and citrate synthesis. However, we point out that these experiments were already published in a manuscript from our group specifically showing that the NAD+/NADH ratio is limiting for citrate synthesis (PMID: 35739397). In that manuscript, the NAD+/NADH ratio was perturbed using electron transport chain inhibitors, including complex I inhibitors, which decreases the cell NAD+/NADH ratio. Exogenous electron acceptors were used to rescue the NAD+/NADH ratio, and under those conditions, cell proliferation, the NAD+/NADH ratio, and glucose and glutamine oxidation to citrate were measured with and without lipid depletion. We showed that decreasing the NAD+/NADH ratio decreases citrate synthesis through both glucose and glutamine oxidation and also affects palmitate synthesis. We could rescue citrate and palmitate synthesis by supplementing cells with exogenous electron acceptors. We also show that expressing cytosolic or mitochondrial NADH oxidase (LbNOX; PMID: 27124460) in mitochondrial complex I-inhibited cells rescues proliferation in lipid depleted conditions and that LbNOX expression raises oxidative citrate production at baseline. Given the extensive prior work showing the relationship between the NAD+/NADH ratio, oxidative citrate synthesis, and palmitate synthesis, efforts to repeat these same experiments for this manuscript were not warranted. We do show in the current manuscript that treating cells with AKB or FCCP, which raises the NAD+/NADH ratio, also increases glucose oxidation to citrate (Figure 5D of the original and revised manuscripts). We did this to confirm that the elevated M+2 citrate production from glucose in serine starved H1299 cells was related to an increase in the NAD+/NADH ratio as opposed to a specific response to serine depletion.

      The study focuses predominantly on mitochondrial respiration as a source of NAD+ regeneration. However, it will also be interesting to check other significant pathways, such as NAD+ salvage, which have been implicated in supporting serine biosynthesis. In addition, the subcellular distribution of NAD+ may distinguish whether some cells are truly redox-unresponsive. Mitochondrial NAD+ regeneration might counteract the cytosolic NAD+ consumption, rendering a relatively stable intracellular NAD+/NADH ratio. The malate-aspartate shuttle can be an interesting aspect.

      (A) The role of NAD+ salvage and serine biosynthesis

      Per the reviewer’s request, we investigated whether NAD+ salvage might be involved in supporting serine synthesis. Specifically, the reviewer comments highlight an interesting question about whether NAD+ salvage may differentially contribute to serine synthesis between cancer cells that elevate mitochondrial respiration in response to serine depletion and cancer cells that do not change mitochondrial respiration in response to serine depletion. Specifically, we wondered whether cancer cells that do not elevate mitochondrial respiration in response to serine depletion depend more on NAD+ salvage to support proliferation in serine depleted conditions. To test this, we treated A549 and H1299 cells in serine depleted conditions with increasing doses of the nicotinamide phosphoribosyltransferase (NAMPT) inhibitor FK866. However, we found no statistically significant difference in sensitivity to FK866 upon serine depletion in these cells based on ANCOVA analysis (p=0.9332). Interestingly, we observe that A549 cells are more sensitive to FK866 treatment than H1299 cells in serine-replete media conditions (ANCOVA analysis, p=0.0004). This suggests that A549 cells at baseline may have greater dependence on NAD+ salvage compared to H1299 cells, though this is not specific to the response to serine depletion. We then asked whether nicotinamide mononucleotide (NMN), the product of NAMPT and the immediate precursor to NAD+ in the salvage pathway, would rescue the proliferation of A549 cells cultured without serine. We find that adding 100 µM NMN, a concentration that can impact PHGDHdriven serine synthesis (PMID: 30157431), does not change proliferation of A549 cells cultured without serine, unlike supplementing cells with AKB or FCCP, which increase NADH oxidation to NAD+. Together, these data suggest that NAD+ salvage does not play a major role in differentiating the redox response to serine deprivation between responder and non-responder cells. We have added these data as Supplementary Figure 3C,D of the revised manuscript.

      (B) The role of the malate-aspartate shuttle and serine biosynthesis

      The MAS has been shown to play an important role in serine synthesis (PMID: 37647199) and may facilitate elevation in mitochondrial respiration in response to serine depletion. As stated in response to Reviewer 2, measuring subcellular compartmentspecific NAD+/NADH ratios accurately is not feasible, so we utilized a functional approach to interrogate the role of compartmentalization. Specifically, we tested a role for the malate-aspartate shuttle (MAS). Using CRISPR/Cas9, we generated GOT1, MDH1, and GOT2 deleted H1299 cells. We did not knock out MDH2 given its integral role in the TCA cycle. Using the knockout lines, we measured the whole cell NAD+/NADH ratio and found that MDH1 and GOT2 KO cells no longer exhibited an elevated cell NAD+/NADH ratio upon serine depletion compared to non-targeting controls (NTC). Consistently, MDH1 and GOT2 KO cells did not elevate OCR upon serine deprivation, nor did they exhibit greater serine synthesis rates compared to NTC cells. This suggests that MDH1 and GOT2 activity support the process by which mitochondrial NAD+ regeneration provides cytosolic NAD+ to support serine synthesis. We next asked whether MAS protein expression differed between cells that elevate respiration in response to serine depletion and cells that do not. While enzyme expression is not equivalent to activity, we wondered whether MAS protein expression would be lower in cells that do not increase their mitochondrial respiration upon serine depletion. However, we observed no major difference in GOT1, GOT2, MDH1, or MDH2 protein expression across the cancer cells examined (Author response image 8). Further experimentation is needed to measure MAS activity across lines and may reveal a mechanism by which mitochondrial respiration is governed by nutrient availability, such as levels of environmental serine.

      Author response image 8.

      Protein expression of the malate aspartate shuttle enzymes GOT1, MDH1, GOT2, and MDH2 in cancer cells cultured without serine for 24 hours. Membranes were first probed for GOT1 or GOT2 then stripped and re-probed for MDH1 or MDH2.

      (3) The authors should acknowledge the limitations of short-term isotope tracing in their experimental design. Differences in metabolic rates across cell lines can affect the kinetics of metabolite labeling, limiting the direct comparability of metabolic fluxes between them. As a result, observed changes may reflect transient adaptations rather than stable metabolic reprogramming. It is important to clarify that the study primarily captures short-term responses, and the conclusions may not extrapolate to longer-term adaptations or protein-level changes under sustained nutrient stress.

      We thank the reviewer for this comment. We apologize for any confusion around experimental approaches. We agree that in the case of acute changes in nutrient availability at the start of kinetic isotope tracing, the observed changes may reflect transient adaptations. However, cells are exposed to conditions for 24 hours prior to performing kinetic tracing. This approach allows us to examine changes that occurred in response to the nutrient condition, not acute changes. Additionally, we add fresh, prewarmed treatment media at least two hours prior to commencing kinetic isotope tracing. Upon analysis of kinetic isotope tracing, we examine whether cells were at metabolic steady state by monitoring metabolite levels over the course of tracing. For example, in the kinetic glucose tracing experiments in serine depleted cells, total serine levels are relatively stable throughout the experiment, and we find that total serine levels are greater in H1299 cells after 24 hours of serine starvation. Data showing total metabolite pools over the course of tracing are shown in the Supplementary Figures (for example, see Supplementary Figure 8C-H in the revised manuscript). The period of treatment prior to the start of kinetic isotope tracing is described in the figure legends and further detailed in the “Kinetic U-<sup>13</sup>C-Glucose Isotope Tracing Experiments” section of the Methods in the revised manuscript. To improve clarity, we added a kinetic graph showing total serine levels over time in Supplementary Figure 2I of the revised manuscript as this can address whether synthesis rates are captured while cells are at metabolic steady state. We also discuss these considerations better in the revised manuscript with the following text:

      “Importantly, we confirmed kinetic U-<sup>13</sup>C-glucose tracing was performed at metabolic steady state by ensuring metabolite levels were stable at each collected time point (Supplementary Figure 2I)” (Lines 178-180).

      Reviewer #3 (Recommendations for the authors):

      It is important to note that, in many cases, the data show only trends rather than statistically significant differences, or, if significance testing was performed, the results are not clearly labeled. For example, in Figure 1B, no p value was denoted in the figure, and the scale bar is quite high, precluding the conclusion that "AKB and rotenone dosedependently increased and decreased the cell NAD+/NADH ratio". In Figure 2E, no pvalue was shown to support the result that "H1299 cells had higher serine level than A549 cells". Inconsistencies in how significance is denoted across figures (e.g., asterisks vs. numerical values; "ns" vs. no label) make interpretation difficult. Marginal significance (e.g., p = 0.06 in Figure B) can be reported explicitly, but all figures should clearly denote whether comparisons are significant or not. Conclusions drawn from nonsignificant trends should be appropriately stated.

      We thank Reviewer 3 for this important comment and for highlighting specific instances where the manuscript could be improved. Please see response to Reviewer 3, Major Comment 1B. We also agree with Reviewer 3 that it is integral to ensure that conclusions made from non-significant trends are appropriately stated. For example, we explicitly mention that there was no statistically significant difference between the serine synthesis rate of A549 cells depleted of serine versus A549 cells depleted of both serine and lipids (Line 375). As another example, we changed the phrase “Moreover lipid depletion led to a greater fraction of total serine derived from glucose in serine depleted A549 cells” to “Moreover, lipid depletion appeared to lead to a greater fraction…” (Line 376).

      Western blot data supporting PHGDH expression variability across cell lines (e.g., Supplementary Figure 2B, 3E) appear to rely on single experiments. At least three biological replicates are required to substantiate claims about discordance between PHGDH levels and serine sensitivity. Supplementary Figure 4G presents overexpression validation based on a single Western blot without quantification. Including statistical validation from biological replicates would strengthen this point.

      We thank Reviewer 3 for this suggestion. Western blots were repeated 3 times, although data from a representative blot is shown. Please see response to Reviewer 3, Major Comment 1B.

      Certain data visualizations (e.g., Figure 2C) lack annotation indicating which data points correspond to which cell lines, limiting interpretability. All figures should include clear labels, consistent statistical notation, and complete legends. The author uses different color labels (redox-responsive (blue) and unresponsive (yellow) cell lines), which provides mechanistic clarity; however, this classification was not consistently used across the manuscript (e.g., Figures 2d and 2e). To further improve reader comprehension, consider adding conceptual schematic diagrams before each main result section to illustrate experimental logic, and a final diagram summarizing the proposed mechanism.

      We apologize for any unclear data presentation. In the revised manuscript we have added greater clarity around what cell lines are used in each experiment and have added explicit labeling to specify cancer cell lines in Figure 2C of the revised manuscript. Throughout, we have ensured that any serine redox non-responder cell lines are labeled in yellow while serine redox non-responder cell lines are labeled in blue. We have also ensured that any lipid redox responder cells are labeled in green while lipid redox non-responder cells are labeled in dark purple, a change from the original manuscript. Finally, we have also added a schematic to summarize the proposed model in Figure 7 of the revised manuscript.

      Although the authors provide justification for using H1299 and A549 as representative cell lines to study serine depletion, it remains unclear whether these two lines are equally suitable for investigating lipid depletion. Additional rationale or supporting data would help clarify their appropriateness for the lipid-related experiments.

      We thank Reviewer 3 for this suggestion. We opted to study H1299 and A549 cells under lipid deprivation to assess their responses in relation to the response to serine deprivation. We specifically wanted to know whether these findings related to serine deprivation applied to other nutrient depleted conditions. We clarify this logic in the revised manuscript by adding the following text:

      “Oxidative biosynthetic reactions other than serine synthesis can also be constrained by the NAD+/NADH ratio. For example, cancer cells deprived of environmental lipids increase oxidative citrate production, and we have previously found that citrate synthesis, either through glucose oxidation or glutamine oxidation, is limited by NAD+ availability (Li, 2022) (Figure 5A, Supplementary Figure 8A). Thus, we sought to uncover whether the increase in the cell NAD+/NADH ratio by mitochondrial respiration in response to serine withdrawal specifically supports greater serine synthesis or also leads to greater oxidative citrate production.” (Lines 307-313)

      We have also included more detailed justification for focusing our studies on A549 and H1299 to study serine depletion by adding the following statements to the manuscript:

      “We performed focused comparisons between A549 and H1299 cells because they exhibit differences in proliferation upon serine deprivation that are not explained by PHGDH protein expression, demonstrate differing responses of the cell NAD+/NADH ratio upon serine deprivation, and have similar basal proliferation rates.” (Lines 171-175)

      The concentration of serine in replete media should be explicitly stated and justified. If the intention is to mimic physiological conditions, alignment with human plasma levels would increase translational relevance.

      We agree that explicitly stating the concentration of serine in replete media is important. In the revised manuscript, we explicitly state that DMEM contains 400 uM of serine and that we use this concentration for serine-replete conditions (Line 102). While an important application of our manuscript is to better explain metabolic changes that can occur in physiologic conditions, we acknowledge that we did not test levels found in different tissues. Rather, by examining extreme conditions of high and low serine, we hoped to dissect how cells adapt to nutrient conditions, and testing the more subtle responses based on tissue serine levels will require a dedicated study.

      Rotenone may elevate ROS levels and trigger cellular stress responses, potentially confounding proliferation assays. The authors should validate that concentrations used do not induce cytotoxicity or excessive oxidative stress, and ideally measure ROS levels to support interpretation.

      We thank Reviewer 3 for raising this important point. We explicitly measured cell viability with the doses of rotenone used in this manuscript in cells cultured with or without serine. We find that rotenone dose-dependently increases cytotoxicity in A549 cells grown in serine-replete conditions in a statistically significant manner as calculated by simple linear regression. However, the cytotoxicity from rotenone is low (at most 4% in serine depleted conditions) and does not explain differences to rotenone sensitivity with respect to serine synthesis. These data have been added to Supplementary Figure 1C of the revised manuscript.

      Evidence for lipid depletion can enhance serine synthesis in A549 cells is inadequate, for the marginal difference in NAD+/NADH ratio and slight increase of M+3 serine levels. The statement "any perturbation that increases the NAD+/NADH ratio led to both elevated serine and citrate production, regardless of what nutrient was depleted from the environment" (introduction section) should be reworded.

      We thank Reviewer 3 for this suggestion. We have changed the above statement to the following:

      “Lastly, we find that any perturbation that increases the NAD+/NADH ratio, including lipid deprivation, could paradoxically improve the proliferation of cells in serine depleted conditions.” (Lines 90-92).

    1. eLife Assessment

      This study addresses an important gap in drug discovery by delivering a rigorous, large-scale evaluation of widely used co-folding methods for predicting ligand-bound protein complexes and virtual screening. A key strength is the comprehensive benchmarking framework, which leverages structures and chemical compounds that were absent from the AI models training set, thereby providing particularly compelling and unbiased evidence of co-folding performance. The findings clearly delineate the complementary roles of deep learning-based co-folding and physics-based docking, offering practical guidance for their rational integration into drug discovery workflows. Overall, the conclusions are well supported by thorough analyses across a representative set of cases and are highly convincing.

    2. Reviewer #1 (Public review):

      The authors conducted a comprehensive benchmarking and evaluation of co-folding platforms, including AlphaFold3, Boltz-2, Chai-1, and the docking algorithm Dock3.7, which employs a physics-based scoring function that incorporates van der Waals interactions, electrostatics, and ligand desolvation energies. The system of interest was the SARS-CoV-2 NSP3 macrodomain (Mac1), an increasingly popular antiviral target, and the ligand sets comprised 557 unseen ligand poses (keeping the training for these co-folding platforms in mind). Additionally, the authors investigated whether the co-folding models could distinguish true ligands from non-binding small molecules. The study is thorough, with extensive statistical support and consensus across multiple metrics (chemoinformatics for quantifying ligand similarity and efficacy). The questions that the authors aim to address are whether the co-folding models struggle with memorization, whether they can distinguish between a true and a false binder, whether they replicate experimental binding affinities and efficacy, and how they compare to the physics-based docking algorithm (Dock3.7).

      Strengths:

      Overall, this is a scientifically solid paper.

      The work is highly detailed and well executed, featuring thorough data analysis and statistical assessment.

      Comments on revised version:

      The authors have adequately addressed my concerns.

    3. Reviewer #3 (Public review):

      Summary:

      Core conclusions are well-supported by data: co-folding outperforms docking in known ligand pose/affinity prediction (validated by RMSD and IC₅₀ correlation), struggles with false positive discrimination in virtual screens (lower AUC values), and is complementary to docking (non-correlated errors, distinct strengths in drug discovery stages).

      Strengths:

      Unprecedented prospective design with 557 novel Mac1-ligand complexes ensures rigorous, independent evaluation of co-folding methods, provides an unbiased and rigorous benchmark dataset, which contains structures and compounds absent from the co-folding models training sets. Comprehensive comparison of 3 co-folding tools (AlphaFold3, Chai-1, Boltz-2) with DOCK3.7 across diverse targets and metrics enables nuanced performance assessment. The revised results clarify an intriguing finding: co-folding can predict correct ligand poses even when protein formations are mispredicted. The study clearly demonstrates complementary roles of co-folding (superior pose/affinity prediction for known ligands) and docking (better hit prioritization), and addresses deep learning memorization concerns via ligand similarity analysis.

      Weaknesses:

      The study identifies a major limitation of co-folding-failure to capture rare protein conformational changes, which deserve future investigation. The authors include uncalibrated Boltz-2 affinity data (addressing a prior comment) but note that large-scale free energy perturbation (FEP) comparisons are beyond their capabilities.

      Appraisal of Aims Achieved:

      The authors successfully achieved their primary aims and the results provide strong, well-supported evidence for their core conclusions. Key conclusions are grounded in the study's unbiased, training-set independent data, ensures the conclusions are not confounded by model memorization and are broadly applicable to the field's use of these co-folding models.

      Field Impact:

      This study provides a critical reality check for the field: co-folding models are powerful tools for pose prediction but are not yet standalone solutions for virtual screening, a key distinction that will prevent over-reliance on these models and guide more rational tool selection.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors conducted a comprehensive benchmarking and evaluation of co-folding platforms, including AlphaFold3, Boltz-2, Chai-1, and the docking algorithm Dock3.7, which employs a physics-based scoring function that incorporates van der Waals interactions, electrostatics, and ligand desolvation energies. The system of interest was the SARS-CoV-2 NSP3 macrodomain (Mac1), an increasingly popular antiviral target, and the ligand sets comprised 557 unseen ligand poses (keeping the training for these co-folding platforms in mind). Additionally, the authors investigated whether the co-folding models could distinguish true ligands from non-binding small molecules. The study is thorough, with extensive statistical support and consensus across multiple metrics (chemoinformatics for quantifying ligand similarity and efficacy). The questions that the authors aim to address are whether the co-folding models struggle with memorization, whether they can distinguish between a true and a false binder, whether they replicate experimental binding affinities and efficacy, and how they compare to the physics-based docking algorithm (Dock3.7).

      We thank Reviewer 1 for this thoughtful summary of our work.

      Strengths:

      Overall, this is a scientifically solid paper. The work is highly detailed and well executed, featuring thorough data analysis and statistical assessment.

      Weaknesses:

      My main concern is that the study's aim is a bit unclear. Modern benchmarking studies comparing physics-based docking with deep learning-based co-folding approaches (e.g., AF3, Boltz-2, Chai-1, and others) are increasingly expected to go beyond aggregate performance metrics.

      Indeed, we have gone into several examples of failures and successes for each of these methods. As we are not developing these methods ourselves, we also think this dataset will be a valuable contribution for improving them further.

      In addition to rigorous dataset construction, transparent methodology, and appropriate statistical evaluation, high-impact benchmarks typically provide actionable guidance on when each method class is most appropriate, reflecting their distinct inductive biases and practical constraints. Failure-mode analyses that link performance differences to protein flexibility, ligand chemistry, or binding-site characteristics are particularly valuable, as they move comparisons beyond "scoreboard" assessments toward mechanistic understanding.

      Right now, we do not observe meaningful trends that separate the failure modes for any individual method. This is covered in Supplementary Figures 6 and 7.

      While full biological validation is not expected, qualitative interpretation grounded in physical and biological principles strengthens conclusions. Providing reproducible workflows or reference pipelines is not mandatory, but it is increasingly viewed as a best practice because it facilitates adoption and helps contextualize results for practitioners.

      We note that our code is available (https://github.com/jongbin99/Cofolding/) and all structural data will be publicly accessible in the PDB alongside publication (we only held it back only for “blinding” during peer review to avoid contamination with any new deep learning methods).

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Kim et al. evaluates the performance of three modern AI-based methods in predicting complex structures and binding affinities between proteins and chemical compounds. An honest 'prospective' evaluation is achieved by studying benchmark structures and chemical compounds that did not exist in the PDB at the time the AI structure prediction models (AlphaFold3, Chai-1, Boltz-2) were trained.

      Strengths:

      (1) The study addresses an important question in modern computational biology and drug discovery, and establishes the strengths and limitations of the three tools in solving various computational chemistry tasks, including compound pose prediction, active-inactive discrimination, and potency ranking.

      (2) The conclusions are based on examination of four separate targets and respective compound datasets, where for one of the targets, the authors also obtained numerous X-ray structures to serve as experimental answers for the binding pose prediction task.

      (3) The study reports relationships between structure prediction confidence, predicted energies (DOCK3.7), and affinity predictions (Boltz-2) with the geometric accuracy of compound pose prediction as well as the experimentally measured potency.

      (4) One of the key findings is the limited ability of co-folding methods to predict conformational rearrangements, which does not correlate with their ability to predict binding poses of the compounds inducing these rearrangements.

      (5) The findings could serve as useful guidelines for computational chemists in selecting appropriate software and scoring schemes for each task.

      We appreciate Reviewer 2’s summary of the novelty of the dataset and analysis.

      Weaknesses:

      While I consider this a solid study, several aspects would need to be addressed to make it really strong:

      (1) DOCK3.7 docking and scoring experiments were performed using one experimental structure of Mac1, selected from dozens of structures based on a criterion that is not sufficiently well justified. For sigma2 receptor, dopamine D4 receptor, and AmpC β-lactamase, it is not clear which structures or models were selected for docking at all. It is well known that geometry predictions, scoring, and active-inactive ROC AUCs are all strongly influenced by the selected structure. It would be important to attempt Mac1 docking using all available experimental Mac1 structures, or at least against representative structures in various conformations; it would also be quite insightful to compare results to docking of the same compound sets to AF3, Boltz-2 and Chai-1 predicted structures of Mac1. Same goes for the docking studies of sigma2, D4, and AmpC β-lactamase.

      In any program, a decision has to be made as to which template will be used for docking, we justified the choice in the methods:

      “We used this structure because the inhibitor (Z5014193706) was the most potent molecule with a structure determined around the same time as the ligands in this dataset were tested.”

      We stand by this as a reasonable assumption. Similarly, for sigma2, D4, and AmpC β-lactamase, the template was chosen in the respective papers:

      a) The σ2 receptor bound to cholesterol (PDB ID: 7MFI) was used in the docking calculations.

      - This structure was determined in the paper, the first structure of sigma2 and therefore a worthy template

      b) The D4 receptor campaign used PDB 5WIU

      - This was one of two D4 structures available and chosen because it was not bound to sodium

      c) For AmpC, the campaign used the structure in the Protein Data Bank (PDB) 1L2S

      - This maximizes comparisons to other docking studies that used the same receptor template.

      The major goal of this study is to compare different methods under reasonable (but perhaps as the reviewer points out, not optimal) conditions, not to optimize docking score.

      (2) For binding affinity predictions, as a control, authors should consider compound co-folding with an unrelated protein, or even with a pseudo-peptide that consists of a few random single amino acids - this would provide an honest baseline for such predictions.

      This suggestion would be valuable for understanding the performance for these methods from the perspective of ligand specificity (a valuable, but separate, goal). Surely this will generate some number or some prediction - but what would this baseline mean and how would it be relevant for drug discovery? Therefore, we do not think this suggestion is relevant for the issues being investigated in this manuscript.

      (3) ROC curves Figure 3 and elsewhere should be shown, and AUCs quantified/reported on a log or square-root scaled x-axis, to emphasize early enrichment, which is the area of practical significance for these predictions. For example, Figure 3A currently suggests that the pose prediction performance of AF3 exceeds that of Boltz-2 whereas the early enrichment is clearly better for Boltz-2.

      We agree with this, and added a semi-logAUC plot for Figure 3A. For Figure 5, we also generated a semi-logAUC plot to see early ligand enrichment clearly, added as Supplementary Figure 11. We added the text:

      “Considering its early enrichment performance, Boltz-2 Ligand ipTM was the strongest predictor of pose accuracy based on normalized logAUC (20.5% above random, Fig. 3a). In contrast, although Boltz-2 pIC50 showed poor overall discrimination, it overestimated its ability to enrich true positive poses at low false positive rates, despite having a weak early enrichment behavior”

      (4) 'Trained set' in figures and text should probably be 'training set'? Or otherwise explain this new term the first time it is introduced.

      Thank you for pointing out this for clarification. ‘Training set’ is the correct word, and we made changes appropriately across all figures and texts.

      (5) Figure 1 illustrates a projection onto the first two principal components of a space that apparently had only one (scalar) metric for each compound pair (% maximum common substructure or Tanimoto coefficient); the authors need to better explain the principle behind this analysis and visualization.

      This suggestion is valuable, since we often use PCA to reduce dimensionality for more complex features. For clarification, we actually have a full pairwise similarity matrix for all tested Mac1 compounds based on each of Tc and MCS%. PCA for each MCS% and Tc is a representation of each pairwise similarity matrix. We also made a change in Figure 1 caption to make this point clearer:

      “projection of compounds represented by their full pairwise similarity vectors (by ECFP-4 Tc and MCS%)”

      Reviewer #3 (Public review):

      Summary:

      This study's core conclusions are well-supported by data. It is shown that co-folding outperforms docking in known ligand pose/affinity prediction (validated by RMSD and IC₅₀ correlation), struggles with false-positive discrimination in virtual screens (lower AUC values), and is complementary to docking (non-correlated errors, distinct strengths in drug discovery stages).

      Strengths:

      (1) Unprecedented prospective design with 557 novel Mac1-ligand complexes ensures rigorous, independent evaluation of co-folding methods.

      (2) Comprehensive comparison of 3 co-folding tools (AlphaFold3, Chai-1, Boltz-2) with DOCK3.7 across diverse targets and metrics enables nuanced performance assessment.

      (3) The study clearly demonstrates complementary roles of co-folding (superior pose/affinity prediction for known ligands) and docking (better hit prioritization), and addresses deep learning memorization concerns via ligand similarity analysis.

      We thank Reviewer 3 for pointing out the unprecedented and comprehensive nature of our study

      Weaknesses:

      (1) Limited generalization to diverse protein families (e.g., no ion channels/transporters).

      We agree - we have not explored the entire proteome and these are important target classes that will surely be investigated by future studies. We focused on targets here where we had large number of X-ray crystal structures (Mac1) and affinity/inhibition measurements from docking (the other three targets).

      (2) Ambiguity in the mechanism underlying co-folding's failure to predict rare conformational changes.

      Again, we agree. We are not the developers of these methods. We observe that these methods do not predict conformational changes with high fidelity and this weakness is an area that co-folding methods will surely prioritize in the future.

      (3) Virtual screen comparison is unbalanced (docking-prioritized hit lists bias results).

      We acknowledge this in the results: “An important caveat is that the hit-lists were composed of molecules prioritized by docking in the first place, giving it an advantage on these particular sets.” and discussion: “Finally, comparing co-folding to docking based on hit-lists themselves selected by docking is arguably unfair to co-folding. Counter-balancing this is the inclusion, in each of the three hit lists, of molecules that had mediocre and poor docking scores intentionally selected to test the correlation between docking score and hit-rate. Here too, the correlation between co-folding score and likelihood to bind, what we sometimes call a “dock-response-curve” was no better than docking’s, often worse (SFig.11).”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Here are suggestions for revisions:

      (1) The writing is at times obtuse and hard to follow.

      This happens sometimes when multiple authors are writing together. We apologize and are happy to respond to specific areas that can be streamlined to be easier to follow.

      (2) In the Results section, "A set of 557 previously unreported Mac1 ligand complexes", the authors have compared the ligand poses across different metrics such as Tc - a standard, highly effective method in chemo-informatics and MCS (maximum common substructures); these are standard metrics for quantifying the structural similarity between pairs of small molecules. This part of the analysis checks whether this is memorization; it is critical to compare the two metrics, but it is not sufficient to draw a conclusion.

      Thank you for pointing out about the structural similarity of molecules co-folded to those present in the training set (resolved as Mac1 complexes and deposited in PDB before training dates). We have conducted an analysis where we do a pairwise similarity comparison for all ligands present in the PDB (regardless of the target), by both Tc and MCS, and overlay the cluster of ligands we tested (Mac1, AmpC, sigma2, D4). This should show where our tested benchmark datasets lie in the chemical space covered in the entire PDB. Each cluster (around 500 to 1300 compounds per target system) is overlaid on the cluster of all ligands deposited in PDB (over 50,000 compounds), and each cluster was relatively diverse by both Tc and MCS.

      (3) In the "Co folding can accurately reproduce poses of ligands dissimilar to those trained." Subsection under Results, the authors' conclusions are hard to follow; they state that the co-folding models often mispredict or miss the alternative conformation, but they also predict poses that are distinct from the training set. What does that imply?

      Our interpretation is actually a somewhat unsettling one: co-folding gets the ligand pose right even when it gets the protein wrong, and even when the ligand is novel. This suggests the models may be anchoring on conserved pharmacophoric interactions (like the adenosine-mimicking purine scaffold) rather than truly modeling the physics of the full complex. We added to the results section:

      This result suggests that co-folding reliably recapitulates dominant ligand-binding interactions even in the absence of accurate protein conformational modeling, providing further support to the idea that they are learning specific interaction patterns rather than a deeper physics-based representation (Masters et al. 2025).

      (4) The Discussion section connects the results and conclusions, but it can be challenging to grasp the study's overall message.

      We think the final paragraph hits on three major points:

      - Co-folding accurately predicts ligand poses for known binders, but fails to capture conformational changes

      - Co-folding does not reliably distinguish true binders from false positives in virtual screening hit lists

      - Docking and co-folding are complementary rather than competing tools

      (5) The work is highly detailed and well executed, featuring thorough data analysis and statistical assessment. The value of the paper would be further enhanced by explaining how it differs from seemingly similar results reported in other studies, including the one cited in this manuscript (see https://www.biorxiv.org/content/10.64898/2025.12.04.692352v1).

      The Mac1 results are completely unique. However, the docking datasets are exactly the same as those analyzed in the Menon et al manuscript. We don’t think our results differs from conclusions of the Menon et al manuscript as we wrote: These observations are supported by a fascinating study on some of the same ligand sets as investigated here, using AlphaFold3, reaching similar conclusions (Menon et al. 2025).

      Reviewer #3 (Recommendations for the authors):

      (1) Expand target diversity to include ion channels, transporters, etc., beyond enzymes and GPCRs.

      (2) Investigate the cause of co-folding's failure in predicting rare conformational changes (e.g., adjust sampling, MSA inputs, or add experimental constraints).

      (3) Mitigate docking bias in virtual screens (e.g., re-analyze unbiased compound libraries).

      We addressed these three points in the public review above

      (4) Test Boltz-2's affinity predictions without linear calibration and compare with FEP.

      The data without linear calibration are included in the manuscript. Comparing such a large number of compounds with FEP is currently beyond our capabilities.

      (5) Conduct proof-of-concept to test co-folding-docking integration for better hit rates.

      We think this is well beyond the scope of this manuscript - but look forward to testing this idea in the future.

      We also got one community review that we respond to below:

      Summary

      This manuscript evaluates the performance of co-folding models when tasked with 1) the recapitulation of a large number of experimentally determined co-crystal structures of Mac1 with a series of Mac1 ligands and 2) the rescoring of hits to identify false positives originally derived from a set of large docking-based virtual screens. The evaluation leverages a dataset of crystal structures and affinity data from high-throughput crystallographic and biophysical screens, respectively. These data uniquely enable this report to focus on the ability of co-folding models to handle ligands, resulting in an analysis that is particularly timely given the wide adoption of co-folding models and the relative scarcity of such ligand-focused benchmarks among existing evaluations, which have primarily focused on protein structure prediction or binder design.

      Thank you for this thoughtful summary of our work

      Feedback

      The experiments and analyses in the manuscript are well thought-out and do not have any significant issues. There are a few high-level points that may improve the clarity and completeness of the results. Importantly, none of the suggested additional experiments will affect the conclusions of the paper, but rather help provide additional context for the results:

      The first section presents an exciting opportunity to frame the Mac1 ligands against ligands in the PDB more broadly. It would be informative to assess whether chemotypes that are easier or harder to predict accurately and confidently are over- or under-represented in the PDB as a whole. Note that this is not a recommendation that new scaffold similarity metrics be incorporated into the analysis, but rather that analyses similar to those already performed in the manuscript are performed using all ligands in the PDB. For example, PCA-based analyses similar to those in Fig. 1c could be used to examine Mac1 ligands in the context of all PDB ligands enabling questions such as whether similarity to a nearest PDB neighbor, cluster size in a Tc/MCS PCA space, or other frequency-based measures show any relationship with prediction vs. crystal structure RMSD. Such analyses could provide additional insight into how effectively models leverage ligand information present in the PDB overall, as opposed to biases arising specifically from scaffolds represented in Mac1 structures in the PDB, which are already well covered in the manuscript. The conclusion that Tc/MCS do not correlate with the ligand RMSDs for the ligands already associated with the Mac1 is well supported, and presumably suggests that a correlation would not exist against the backdrop of the PDB, but it would be interesting to see the data using analyses similar to those already done in the manuscript nonetheless.

      We are adding new figures in SFig.1 that consider how different clusters of ligands tested for our co-folding analysis are distributed across the chemical space in PDB. This is done by making a similarity comparison between every ligand in PDB and those tested in our analysis by Tc and MCS%, then plotting in PCA space for each metric. We are excited to see that each dataset covers a wide scope in PCA space, but at the same time, there are unexplored areas in the chemical space of PDB by co-folding.

      Similarly, even though the four proteins used in this manuscript are not themselves the primary focus of the analysis, it would be valuable to perform a high-level assessment of the precedent for each protein in the PDB (beyond the count of liganded structures in Table S6), either in protein sequence space (e.g., MSAs) or structural space (e.g., FoldSeek). An analysis like this would provide important context about whether any of the proteins in the study have close homologs with liganded structures in the PDB, or are generally overrepresented in the PDB. The fact that the AUC for L-pLDDT for AmpC is higher than σ2 and D4, for example, is notable given the relative abundance of liganded AmpC structures in the PDB (this raises potentially interesting questions related to where DOCK3.7 and AF3 actually place the ligands, given the orthosteric β-lactam binding pocket in AmpC, although this is outside of the scope of this manuscript).

      High-level assessment of the precedent for each protein in the PDB will definitely help to understand if proteins we used have close homologs with liganded structures in the PDB. Our Supplementary Table 6 covers the extent to which these liganded structures were available by cutoff dates for AF3, Chai-1 and Boltz-2. AmpC had more homologs than sigma2 and D4, and this may explain a better AUC for AF3 L-pLDDT specifically for this target.

      A discussion of the affinity probability results (`affinity_probability_binary`) from Boltz-2 is likely warranted in the second section in addition to the pIC50s that are already reported (`affinity_pred_value`). The former seems like it would be more applicable for section 2 of the manuscript, but both warrant inclusion—they should both be calculated by default when the affinity pipeline in Boltz-2 is turned on, so it wouldn't involve any more inference.

      As boltz-2 affinity module outputs both affinity probability binary output and affinity predicted value, we kept track of both metrics. So we tried re-ranking hit lists using both metrics. Where boltz-2 performed better (Sigma2, D4), binary probability values were more representative as a metric to differentiate true actives from non-binders. This was more clear in semi-logarithmic ROC plots. However, in AmpC, both Boltz-2 scoring metrics performed similarly. Such inconsistency in trend made it difficult to draw conclusions.

      Minor points

      A more detailed description of the experimental methods used to generate the ground-truth data in the introduction (even though these have been explained in prior works) would help orient the reader early on, and ground the benchmarking aspect of the story. In general, the abstract and introduction would benefit from a more cohesive through-line to tie the two complementary but orthogonal sections of the paper together.

      We will include a more thorough description alongside the PDB depositions. As for the two sections, we have tried to tie them together from the perspective of drug discovery workflows…

      The cutoffs in the "Co-folding can accurately reproduce..." section shift between 2.5 Å (from the ligand center of mass) and 2.0 Å. Is there a reason for this? Along similar lines, mentioning cutoffs for true positives/negatives when introducing the ROC analyses later on in the Mac1 section seems unnecessary since no cutoff should be necessary here.

      We used 2.5A distance to COM to just get at “broadly the correct binding site” for fast filtering and 2.0A RMSD because that is the broadly accepted standard in the field for “relatively correct binding pose”.

    1. eLife Assessment

      The nematode C. elegans is an ideal model in which to achieve the ambitious goal of a genome-wide atlas of protein expression and localization. In this paper, the authors develop a rational and useful strategy for at-scale tagging of all protein coding genes with fluorescent markers, providing solid evidence that it could be a feasible foundation for a large-scale, community-wide project.

    2. Reviewer #1 (Public review):

      Summary:

      Eroglu and Hobert demonstrate that injecting CRISPR guides and repair constructs to target three genes at a time, tagging each with a different fluorescent protein, and selecting which gene to tag with which fluorophore based on genes' expression levels, can improve efficiency of gene tagging.

      Strengths:

      This manuscript demonstrates that three genes can be targeted efficiently with three different fluorophores. It also presents some practical considerations, like using the fluorophore least complicated by agar/worm autofluorescence for genes with low expression levels, and cost calculations if the same methods were used on all genes.

      Weaknesses:

      Eroglu has demonstrated in a previous publication that single-stranded DNA injection can increase efficiency of CRISPR in C. elegans, while inserting two fluorescent proteins and a co-CRISPR marker into three loci, and Paix et al 2015 demonstrated simultaneous insertion of two fluorescent tags. The current work is valuable and incremental advance. In general, I applaud the authors' willingness to strategize about how whole proteome tagging might be accomplished. I predict that the advance here will be one of many small advances that will get the field to that goal. The title oversells the advance presented, in my view, since seems like one among many key advances, and the first sentence of the Discussion seems a more apt summary of the key advance here.

      Some injections targeted genes on the same chromosome together, which will create unnecessary issues when doing crossing that will be useful for some future experiments. This made me wonder if injecting 3 together really is helpful vs targeting each gene separately, since only 5 worms need to be injected. It cuts time down by 2/3, but perhaps avoiding targeting the same chromosome with two tags would be useful.

      The limited utility of current blue fluorescent proteins makes me wonder if it's worth using at this stage, before there are better blue fluorescent proteins, or better yet, far red, to avoid issues with live imaging under phototoxic UV or near-UV illumination.

    3. Reviewer #2 (Public review):

      Original Review:

      The manuscript by Eroglu and Hobert presents a set of strains each harboring up to three fluorescently tagged endogenous proteins. While there is technically nothing wrong with the method and the images are beautiful, we struggled to appreciate the advance of this work - who is this paper for?

      As a technical method, the advance is minimal since the first author had already demonstrated that three mutations (fluorophore insertion and co-CRISPR marker) could be introduced simultaneously.

      As a pilot for creating genome-scale resources, it is not clear whether three different fluorophores in one animal, while elegantly designed and implemented, will be desired by the broader community.

      Finally, the interpretation of the patterns observed in the created lines leaves much to be desired. A Table with all the observations must be included and can replace the tedious (and often wrong) descriptions of the observations with the different lines. It would be too much to point out every mistaken expectation of protein expression. Two examples include:

      The expectation that ACDH-10 is enriched in the intestine and epidermal tissues (hypodermis) is naïve - there are multiple paralogs of this protein (look at WormPaths or WormFlux) that may share functions in different tissues. There is also no reason to assume that fatty acid metabolism does not occur in other tissues (including the germline). Finally, there are no published studies about this enzyme, so we really don't know for sure what it's doing.

      The expectation that HXK-1 is ubiquitously expressed is similarly naïve. There are three paralogous enzymes that are all associated with the same reaction, and we have shown that these three function redundantly in vivo, perhaps in different tissues (PMID: 40011787). Moreover, single cell RNA-seq data (PMID: 38816550) also shows enrichment of hxk-1 in gonadal sheath cells.

      The table should have at least the following information: gene/protein name - Wormbase ID - TPM levels of single cell data assigned to tissues for L2, L4 and adult (all published) - tissues in which expression is observed in the lines presented by the authors.

      Other points:

      (1) We would encourage the authors to provide systematic validation of the reported insertions. The manuscript reports that 24 of 30 tags were isolated and visible but does not clearly state whether each isolated line was confirmed by sequence‑level validation to be correctly in‑frame and free of unintended mutations at the target locus.

      (2) The manuscript presents aggregated success counts (e.g., 8/10 mTagBFP2 tags, 9/10 mStayGold, 7/10 mScarlet3) and useful narrative descriptions of injection outcomes. We suggest also to include per‑locus success rates.

      (3) For pools that required re‑injection after initial failures, we would like to see a description of the specific changes that were made to the injection mixes or procedures (e.g., new repair template prep, different Cas9 reagent lot, guide redesign). This will be useful troubleshooting information for others.

      (4) The authors states that the fluorophore sequences are codon-optimized for C. elegans. We suggest they provide the exact donor/tag sequences used specifically state whether the fluorophore sequences contain any synthetic/artificial introns or other sequence modifications (e.g., silent PAM‑disrupting mutations) were included in the donor templates.

      (5) Page 3: Include a reference for "The C. elegans genome encodes around 20,000 genes"

      We hope these comments are useful.

      Comments on Revised Version:

      Overall, we found the responses to be quite recalcitrant.

      We have one remaining composite concern about the comparison between observed expression patterns with the new strains versus published data.

      First, the authors only report patterns for one stage while it should be not too much effort to image the different life stages. However, since this is a revision, we are not formally requesting they do this.

      Second, in the now provided Table (thank you) 'observed expression' (last column) is lacking for 9 of the 30 proteins, and for 6 of these the procedure was not successful. Why not report patterns for the other three? It is confusing also because on page 5, the authors say that "overall, 24 of 30 tags ...all of which were visible with fluorescence stereomicroscopy" - are we missing something? Also, they then said that they "obtained 6/9 of the originally failed tags"; why are the corresponding patterns not included in table 1, and are 9 proteins still labeled as "no" in the "success?" Column?

      Third, we strongly feel that the response to our comments about expression patterns is not adequate. On page 5 the authors say that "all proteins were expected to be ubiquitously expressed" and that "scRNA-seq indicated that transcript abundance was ubiquitous and without strong tissue-specific enrichment with few exceptions". However, in their rebuttal, the authors now argue for tissue-specific expression for proteins with paralogs, turning around their own argument! Moreover, their Table indicates that many genes show tissue-enriched expression by RNA-seq while many of their tagged proteins exhibit ubiquitous expression.

      Overall, this indicates that both the overall accomplishment of generating tagged protein strains and analyzing their expression is oversold.

    4. Reviewer #3 (Public review):

      Summary:

      The authors argue that establishing the expression pattern and sub-cellular localisation of an animal's proteome will highlight hypotheses for further study. This claim is probably accepted by many in the community. This manuscript seeks to confirm the feasibility of establishing such a resource, by using current transgenic methods to knock in DNA encoding different colored fluorescent tags into C. elegans genes.

      Strengths:

      The authors make the points above. For example, they provide evidence that the C. elegans germline harbors two populations of mitochondria that differ qualitatively in the proteins they express. They also confirm that labelling the whole proteome is an achievable goal with relatively limited resources and time.

      Weaknesses:

      The work is somewhat incremental in that it uses existing transgenic technology. Cell biology in C. elegans is challenging because of the small size of many of its cells, notably neurons. This can make establishing the sub-cellular localisation of a fluorescently tagged protein, or co-localizing it with another protein, tricky. The authors point out in their introduction that advances in light microscopy such as diSPIM, STED and ISM (a close relative of SIM), have increased the resolution of light microscopy. They also point out that recent advances in expansion microscopy can similarly help overcome the resolution limit. However, they do not use these technologies to characterize their transgenic strains.

    5. Reviewer #4 (Public review):

      Summary:

      Tagging the entire proteome of a metazoan would be a landmark achievement, providing a powerful complement and extension to existing "omic" catalogs in model systems. Here, Eroglu and Hobert argue that efficiently tagging multiple loci in a single "batch" would make the community-based achievement of this goal realistic. They provide rigorous evidence that such an approach is indeed feasible, exploring issues related to efficiency, design and screening strategies, disruption of gene function, and the potential for endogenously tagged alleles to reveal unexpected aspects of protein expression and localization. While the work has some minor gaps that are important to rigorously assess the feasibility of the proposed effort, the detailed and valuable insights that emerge should provide impetus to the community to coordinate efforts to make this ambitious goal a reality.

      Strengths:

      The work has numerous strengths. The authors provide compelling evidence that:

      - three distinct loci can be efficiently targeted with three distinct fluorescent tags in a single injection.

      - thoughtful targeting design can reduce the likelihood of disruption of function by the tag.

      - systematic design principles based on expression level and predicted localization/function can be used to optimize tagging strategies.

      - the resulting tags can provide unexpected insight into patterns of protein production and subcellular localization.

      Not all of these advances are novel in themselves, but taken together, they represent an important technical and conceptual advance. The most important strength comes from the exceptionally high value of the goal itself, in that the work is that it has the potential to spur a community-wide effort toward achieving the ambitious goal of proteome-wide tagging.

      Weaknesses:

      The work's shortcomings are minor.

      - One concern has to do with the feasibility of the proposed screening strategies. The experimental design cleverly coinjects tags for three loci in different gene expression 'zones'; this expression level determines which tag will be used. As the authors allude to, there is an important distinction between genes with the same overall FKPM value between those that are expressed broadly and those focally expressed in a specific tissue. The proposed strategy claims that there are a sufficient number of highly expressed genes "to be used as visible markers" for recovering successfully edited animals. It would be useful for the authors to discuss the issue of broad vs focused expression among this set of genes a bit more thoroughly, with an eye toward the issue of how likely it is that these genes could indeed consistently be used as visible markers, particularly for those at the low end of this limit.

      - What fraction of the proteome (on a per-gene basis) is secreted proteins? How difficult will it be to screen these for successful tags? Are there specific tags that would be more optimal for secreted proteins? (The authors mention the use of an SL2 or T2A cassette to label the cells in which these proteins are expressed but note that there are technical challenges associated with doing this at scale.)

      - For secreted and/or weakly expressed genes, it would be useful for the authors to estimate for what fraction of these would successful insertions need to be screened by PCR, and what resources (time and money) this would likely entail.

      - For how many genes would a single tag not capture all predicted isoforms?

      - Finally, some readers might object to the authors' assertion in the abstract that this work is "a first step in this direction" (presumably referring to designing a strategy for whole-proteome tagging). There is no concern that the authors are disregarding the extensive work of other groups, as they explicitly mention the contributions of other groups to the foundation that enables the present work. However, the spirit of the abstract could be misinterpreted by a well-intentioned reader.

    6. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      The nematode C. elegans is an ideal model in which to achieve the ambitious goal of a genome-wide atlas of protein expression and localization. In this paper, the authors explore the utility of a new and efficient method for labeling proteins with fluorescent tags, evaluating its potential to be the basis for a larger, genome-wide effort that is likely to be very useful for the community. While the evidence for the method itself is solid, carrying out this project at a large scale will require significant additional feasibility studies.

      We appreciate the editor’s recognition that the evidence for our method is solid and that a genome-wide protein atlas in C. elegans would be highly valuable to the community. However, we respectfully disagree that “significant additional feasibility studies” are required. Take the yeast proteome-wide GFP tagging project (Huh et al., Nature 2003). It achieved ~75% coverage of ~6,000 proteins directly from an established protocol without any prior significant feasibility studies, at least to our knowledge. While the C. elegans genome is 3 times in size, we would argue that our tagging protocol may even be less labor intensive as it does not involve any cloning and the screening is visual, requiring no molecular biology skills. Reviewer 3 notes: ‘They also provide convincing evidence that labelling the whole proteome is an achievable goal with relatively limited resources and time.’

      Our pilot study validates all key parameters for genome-wide scaling: editing efficiency at novel loci with untested reagents, viability of tagged worms, and detectability of multiple spectrally separated fluorophores across expression ranges. These address the core technical, biological, and practical challenges of large-scale endogenous tagging in a multicellular organism, leaving no fundamental barriers in our view.

      The proposed cost and timeline align quite favorably with established large-scale consortium projects: e.g., ENCODE pilot analyzed 1% of the human genome at ~$55 million over 4 years; Mouse Knockout Consortium scaled to ~20,000 genes over 20 years (ongoing) with ~$100 million; Human Protein Atlas mapped ~87% of proteins with antibodies in fixed cells (through much more labor intensive methods) over 20+ years at >$100 million. With ~8% of C. elegans genes already tagged (WormTagDB) and labs already tagging entire gene classes (PMID: 40463100), scaling our protocol to the proteome is feasible, potentially covering the genome in 5-6 years by a single lab or faster with distributed effort at a reagent cost of merely $2.2 million. The main barriers now are funding commitment and assembling collaborators, not further feasibility testing.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Eroglu and Hobert demonstrate that injecting CRISPR guides and repair constructs to target three genes at a time, tagging each with a different fluorescent protein, and selecting which gene to tag with which fluorophore based on genes' expression levels, can improve the efficiency of gene tagging.

      Strengths:

      This manuscript demonstrates that three genes can be targeted efficiently with three different fluorophores. It also presents some practical considerations, like using the fluorophore least complicated by agar/worm autofluorescence for genes with low expression levels, and cost calculations if the same methods were used on all genes.

      Weaknesses:

      Eroglu has demonstrated in a previous publication that single-stranded DNA injection can increase the efficiency of CRISPR in C. elegans while inserting two fluorescent proteins and a co-CRISPR marker into three loci. The current work is, therefore, an incremental advance. In general, I applaud the authors' willingness to think ahead to how whole proteome tagging might be accomplished, but I predict that the advance here will be one of many small advances that will get the field to that goal.

      Our manuscript indeed builds on prior multiplex editing (including our own co-CRISPR work), but the manuscript's primary contribution is not a novel technical breakthrough per se. Instead, our main goal was to pilot and strategize a feasible path to whole-proteome tagging in C. elegans and, most critically, test the following key parameters: (1) success rate of triple pools with prior untested reagents at novel targets; (2) utility of fluorophores across expression levels; (3) major effects on tagged protein function. In prior multiplexing, we used two targets which we already knew could be edited quite efficiently, with the 3rd target a point mutation with nearly 100% efficiency. Thus, it was not at all clear that picking 3 random genes and replacing the 3rd highly efficient locus with another less efficient large insertion would work or be sufficiently scalable for thousands of novel genes with unvalidated reagents at first pass.

      The title vastly oversells the advance in my view, and the first sentence of the Discussion seems a more apt summary of the key advance here.

      Some injections target genes on the same chromosome together, which will create unnecessary issues when doing necessary backcrossing, especially if the mutation rate is increased by CRISPR.

      We disagree with the reviewer’s assessment of the need for backcrossing, for two reasons: (1) Prior studies have shown that off-target mutations are not a serious concern in C. elegans (reviewed in PMID: 26336798). For instance, WGS of strains after CRISPR/Cas9 found negligible off-target effects (PMID: 25249454, PMID: 30420468 – using similar RNP/ssDNA method and multiple guides; PMID: 23979577, PMID: 27650892 using other methods). Targeted sequencing studies have reported similar findings, using various CRISPR/Cas9 methods, with essentially no mutations at sites other than the intended target (PMID: 23995389; PMID: 23817069). (2) If the goal is to tag the entire genome, the introduction of backcrossing should not reasonably be a routine part of the initial tagging.

      Lastly, if one really does want to backcross, the existence of tags on the same chromosome is actually an advantage because it permits selection for recombinants with wild-type chromosomes.

      Also, the need for backcrossing and perhaps sequencing made me wonder if injecting 3 together really is helpful vs targeting each gene separately, since only 5 worms need to be injected.

      Apart from our disagreement regarding backcrossing, we are puzzled by the reviewer’s comment. Why would one do single tagging at a time, rather than triple tagging if the whole point is to scale up tagging? It is important to keep in mind that the rate limiting step for tagging the whole genome is the number of injections that can be done per day. Since there is no cloning to generate the repair templates/guides and all other reagents are commercially available and not sample specific, these can be prepared quite rapidly. Being able to isolate multiple lines (together or independently) from the same injection increases throughput 3-fold and in our view does not provide any disadvantages as individual tags can be isolated independently if desired.

      Beyond the numerous technical advantages pooling provides (also lower cost and throughput for making injection mixes as well as imaging), our results show that it yields epistemic benefits as well: we would never have noted the subcellular pattern in Fig. 6B, C with different sets of mitochondria being marked by different mitochondrial proteins had we imaged them separately or even aligned to a pan-mitochondrial landmark. As we mentioned in the discussion, grouping proteins predicted to localize to the same compartment together can simultaneously test how uniform or differentiated such compartments are during the screen.

      The limited utility of current blue fluorescent proteins makes me wonder if it's worth using at all at this stage, before there are better blue (or far red) fluorescent proteins.

      We do not think that the utility of current BFPs is that limiting. At least the theoretical brightness of mTagBFP2 is comparable to that of EGFP (PMID: 30886412), which was useful for the bulk of currently tagged proteins. Due to modestly higher autofluorescence in the blue spectrum, the practical brightness is somewhat less ideal, but we have shown that many proteins are expressed high enough to be detected quite well with mTagBFP2 by eye at low magnification. We also note that many tags that are not visible by eye under a dissection scope become visible with long exposure cameras of widefield microscopes or modern confocal (GaAsP) detectors, so the list of genes detectable with mTagBFP2 is likely to be much higher. We routinely use mTagBFP2 to super-resolve subnuclear structures with endogenous tags (e.g., in the nucleolus), with some tags having lower annotated FPKMs than the genes tested here.

      Some literature reviews, particularly in the Introduction and Abstract, rely too much on recent examples from the authors' laboratory instead of presenting the state of the field. I'd like to have known what exactly has been done with simultaneous injection targeting multiple loci more thoroughly, comparing what has been accomplished to date by various laboratories' advances to date.

      We are not sure what the reviewer is referring to. In the Abstract, we do not refer to any literature. In the Introduction, we cite 28 papers, 6 of those from our lab (4 of which providing examples of protein tags). We do not believe that this can be fairly called an unbalanced presentation of the state of the field.

      This being said, we have gladly expanded our Introduction to provide more background on co-CRISPRing. Labs have routinely used co-conversion (“coCRISPR”) markers for picking out their intended edits (e.g., point mutations or insertions), as it has been shown by multiple groups that a CRISPR/Cas9 edit at one locus correlates with efficiency at other simultaneous targets (PMID: 25161212). Generally, making point mutations with the Cas9/RNP protocol is highly efficient, especially at specific loci such as dpy-10. However, multiple FP-sized insertions have not been routinely attempted. We and only one other group have successfully attempted it using previously working targets and reagents (e.g., 28% in PMID: 26187122). Importantly, the efficiency of such multiple insertions has never been assessed at scale and using entirely untested reagents at novel sites – critical parameters to determine for a whole genome approach. So, we test here (1) the efficiency of triple insertions and (2) the chance of getting them with new and untested guides and reagents.

      In our view, since we have to use some injection/coCRISPR marker anyway for those genes which are not expressed at dissecting-scope visible levels (likely most genes), using highly expressed intended targets as improvised markers in a pooled approach makes our approach much more efficient. It allows us to find the worms with the highest chance of yielding CRISPR insertions, which we can screen with higher power methods for the dimmer targets, while enabling us to co-isolate other intended targets. Insertions, being often heterozygous in F1, can be segregated independently if desired, or homozygosed together to facilitate maintenance then outcrossed individually by those interested in studying specific genes in more detail.

      In the revised version of this manuscript, we now discuss some of these points in the introduction section:

      “Currently, around 1554 proteins representing 8% of the proteome are estimated to have been endogenously tagged (Leyhr et al., 2025). However, at current rates, tagging the proteome is projected to take around 100 years and likely involve numerous duplicate attempts on a small number of commonly studied proteins (Leyhr et al., 2025). It will thus be crucial for the field to coordinate tagging efforts and scale up tagging protocols to enable coverage of the entire genome at a reasonable timescale and cost. Given the number of injections is a major time-limiting factor, pooling multiple injections into one would at minimum cut tagging time by a factor of 3. In C. elegans, screening for novel CRISPR/Cas9-induced genomic edits is already facilitated either by use of co-injection markers (i.e., plasmids that form extrachromosomal arrays) that yield phenotypes or fluorescence in progeny of successfully injected worms, or co-editing well characterized loci using established and highly efficient reagents which likewise yield visible phenotypes. In the latter approach, termed “co-CRISPR”, worms edited at the marker locus are most likely to also carry the intended edit (Arribere et al., 2014). Recent methods for CRISPR/Cas9 mediated genomic insertions have pushed efficiencies to sufficient levels to simultaneously insert multiple fluorophores (e.g., mNeonGreen and mScarlet) as well as a co-CRISPR marker (dpy-10) at three independent loci in a single injection (Eroglu et al., 2023; Paix et al., 2015). These attempts pooled reagents previously established to work efficiently and targeted genes that were known to yield functional fusion proteins when tagged. Thus, while in principle current methods could allow tagging of at least 3 independent loci in one injection if a co-CRISPR marker is omitted, it is not known to what extent such an approach could be generalized across the genome with previously unvalidated reagents (i.e., guides and repair template homology arms) at novel loci to yield functional tags”

      Reviewer #2 (Public review):

      The manuscript by Eroglu and Hobert presents a set of strains each harboring up to three fluorescently tagged endogenous proteins. While there is technically nothing wrong with the method and the images are beautiful, we struggled to appreciate the advance of this work - who is this paper for?

      We consider this paper to have two purposes: (1) motivate the community to come together to consider such genome-wide tagging approach; (2) provide a reference point for funding agencies that such an aim is not unreasonable and will provide novel interesting insights.

      As a technical method, the advance is minimal since the first author had already demonstrated that three mutations (fluorophore insertion and co-CRISPR marker) could be introduced simultaneously.

      We agree that the basic principle is similar. However, it was not clear that triple pooling three novel large edits would work, given the numbers in our original paper or that it would be scalable.

      The dpy-10 coCRISPR marker previously used is a highly efficient single site, with close to 100% hit rate. We also knew in the earlier study that the two pooled insertions already worked quite efficiently and did not disrupt the function of targeted proteins. Exchanging these plus dpy-10 for three novel tags was not guaranteed to succeed for many potential reasons, including both biological and technical. For instance, such a “marker free” approach necessitates that a significant number of targets in the genome should be expressed highly enough to be visible by fluorescence stereomicroscopy when tagged with current best fluorophores. The chance of disrupting gene function by tagging was also not explored in detail in C. elegans, nor whether one untested guide is generally sufficient. We think that establishing these parameters was meaningful and necessary for the goal of whole genome tagging. We have clarified some of these points in the text.

      As a pilot for creating genome-scale resources, it is not clear whether three different fluorophores in one animal, while elegantly designed and implemented, will be desired by the broader community. 

      The usage of three different fluorophores is largely driven by the ability to co-inject and therefore cut injection effort by a factor of three. Moreover, having all three fluorophores together facilitates imaging and maintenance. Lastly, co-labeling has the potential to reveal unexpected patterns of co-localization or lack thereof (example: two mitochondrial proteins that we found to not have overlapping distribution). We clarified this point in the revised text in both the results and discussion.

      Finally, the interpretation of the patterns observed in the created lines is somewhat lacking. A Table with all the observations must be included. This can replace the descriptions of the observations with the different lines, which could be somewhat laborious for the reader, and are often wrong. There are numerous mistaken expectations of protein expression here, but two examples include:

      We are not convinced that our expectations are mistaken. Below we respond to the reviewer’s specific examples, and we are open to hear from the reviewer about additional cases.

      (1) The expectation that ACDH-10 is enriched in the intestine and epidermal tissues (hypodermis).

      There are multiple paralogs of this protein (see WormPaths or WormFlux) that may share functions in different tissues. There is also no reason to assume that fatty acid metabolism does not occur in other tissues (including the germline). Finally, there are no published studies about this enzyme, so we really don't know for sure what it's doing.

      The expression of acdh-10 is annotated in multiple scRNA datasets as intestine and epidermal enriched (CeNGEN/Taylor et al. 2021, highest in epidermis; Ghaddar et al 2023 highest in intestine). We did not mean to imply that fatty acid metabolism does not occur in the gonad, nor that a paralog of acdh-10 could not be performing the same function in tissues where acdh-10 is not expressed.

      However, this raises an important question: why have different paralogs doing the same thing? Duplicate genes with the same function are generally not evolutionarily stable (PMID: 11073452, PMID: 24659815). That there are such striking tissue specific expression patterns of an essential or widely expressed protein class suggests that paralogs of the gene likely differ in some meaningful parameter that might align with tissue-specific functional needs or regulation. The reviewer’s statement that ‘there are no published studies about this enzyme, so we really don't know for sure what it's doing’ is in fact an excellent demonstration of our point; finding out where the duplicates are expressed can provide a starting point to uncover potential differences between the paralogs. At the very least it can delineate to what degree paralogs diverge in their expression across the proteome and identify which such cases merit further study. In a more ideal scenario, prior information of protein function could indicate that the involved pathway requires tissue specific regulation.

      (2) The expectation that HXK-1 is ubiquitously expressed.

      Three paralogous enzymes are all associated with the same reaction, and we have shown that these three function redundantly in vivo, perhaps in different tissues (PMID: 40011787).

      The cited paper (PMID: 40011787) does not show where they are expressed. We discussed redundancy/paralogs above in point 1, and in our view the same applies here. They may perform the same reaction but are likely to differ in some meaningful way, be it regulation or rate of activity, for them to be stably maintained as functional genes over evolution.

      Moreover, single-cell RNA-seq data (PMID: 38816550) also show enrichment of hxk-1 in gonadal sheath cells.

      The Ghaddar et al. and CeNGEN/Taylor et al. datasets do not show this. The scRNA paper cited (PMID: 38816550) also shows enrichment in neurons, pharynx, coelomocyte and germ cells which we did not note. In our view, these in fact further support our goals: often, transcript datasets alone (frequently used to infer tissue function) do not sufficiently predict protein expression. One can post hoc find an scRNA-seq dataset that aligns somewhat with our protein observations, but how does one know which to trust a priori? Disagreements between transcript datasets will ultimately require resolution at the protein level, in our view.

      To clarify these points, we added the following to the discussion section:

      “We also noted unexpected cell type dependent distributions of proteins involved in broadly important metabolic processes such as ACDH-10, which was depleted from the germline compared to other tissues, and HXK-1, which was highly enriched in the gonadal sheath. Notably, for these as well as other cases, scRNA-seq datasets were not sufficient to deduce a priori the observed cell type specific differences at the protein level. Importantly, many genes encoding metabolic enzymes including acdh-10 and hxk-1 have paralogs that likely perform similar catalytic functions. Yet, duplicate genes with identical functions are generally not evolutionarily stable (Adler et al., 2014; Lynch and Conery, 2000); thus such genes are likely to differ in some meaningful parameter (e.g., regulation or activity) that might align with tissue-specific functional needs. Fully annotating the expression patterns of paralogs at the protein level could indicate which tissues require unique metabolic needs and indicate which paralogous genes have undergone sub- versus neo-functionalization. For those proteins that are less functionally understood, unexpected distributions might indicate which merit further study.”

      The table should have at least the following information: gene/protein name - Wormbase ID - TPM levels of single cell data assigned to tissues for L2, L4, and adult (all published) - tissues in which expression is observed in the lines presented by the authors.

      We added some of this information such as annotated expression levels in young adults from various scRNA datasets (but not larval datasets as we did not image these). We note that each of these studies use different pipelines and report different metrics (scaled TPM/Z-score versus Seurat average expression versus TPM), so comparisons between them are not informative unless they are integrated and analyzed together.

      Reviewer #3 (Public review):

      Summary:

      The authors argue that establishing the expression pattern and subcellular localisation of an animal's proteome will highlight many hypotheses for further study. To make this point and show feasibility, they developed a pipeline to knock in DNA encoding fluorescent tags into C. elegans genes.

      Strengths:

      The authors effectively make the points above. For example, they provide evidence of two populations of mitochondria in the C. elegans germline that differ qualitatively in the proteins they express. They also provide convincing evidence that labelling the whole proteome is an achievable goal with relatively limited resources and time.

      We appreciate the referee’s recognition that whole proteome tagging is feasible.

      Weaknesses:

      Cell biology in C. elegans is challenging because of the small size of many of its cells, notably neurons. This can make establishing the sub-cellular localisation of a fluorescently tagged protein, or co-localizing it with another protein, tricky. The authors point out in their introduction that advances in light microscopy, such as diSPIM, STED, and ISM (a close relative of SIM), have increased the resolution of light microscopy. They also point out that recent advances in expansion microscopy can similarly help overcome the resolution limit.

      (1) Have the authors investigated if the three fluorescent tags they use are appropriate for super-resolution microscopy of C. elegans, e.g., STED or SIM? Would Elektra be better than mTAGBFP2? How does mScarlet3-S2 compare to mScarlet 3?

      All three tags work for ISM (i.e., Airyscan). We previously tried Electra (not for the genes tested here) but could not isolate positive tags. Given Electra is not that much brighter on paper than mTagBFP2 we did not pursue it further, though we recognize that these may simply have been unlucky injections. mScarlet3-S2 is quite a bit dimmer than mScarlet3 on paper – the advantage is that it has higher photostability. In our view, the limiting factor will be having FPs that are bright enough to screen, image and scale to the whole genome, so brightness will likely provide an advantage over photostability at this stage.

      (2) Have the authors investigated what tags could be used in expansion microscopy - that is, which retain antigenicity or even fluorescence after the protocol is applied? It may be useful to add different epitope tags to the knock-in cassettes for this purpose.

      mSG and mSc3 retain fluorescence after fixing with formaldehyde. We have not tested mTagBFP2 fluorescence in fixed worms. We agree that adding different epitope tags would be useful.

      The paper is fine as it stands. The experiments above could add value to it and future-proof it, but are not essential. If the experiments are not attempted, the authors could refer to the points above in the discussion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Merged figures appear saturated, and use colors that won't work for red-green colorblind viewers. 

      For all figures, we also show individual channels separately, which is common practice for making fluorescence images accessible to colorblind readers (PMID: 33788834). Figures highlighting non-overlap like 6B and C are already in accessible colors when merged (blue/green) and include a numerical quantification. 3-color RGB images preserve the greatest information for the highest number of individuals.

      (2) Targeting ubiquitously expressed genes as a proof of concept gives me some concern that this might underestimate the challenges that may be experienced with less widely expressed genes.

      While the genes were predicted to be ubiquitously expressed, many were not in practice, like HXK-1 and F54C8.1, which were also among the lower expressed genes on our list and highly cell type restricted. As discussed, the more tissue restricted a gene, the likelier that bulk RNA levels underestimate expression. Such genes are therefore more likely to be detected in a specific tissue. We routinely isolate tissue restricted endogenous tags, including those expressed in only a few neurons, with bulk FPKMs lower than the ranges tested in this manuscript.

      (3) Some results are not shown or referenced (autofluorescence, for example, is shown using a schematic in Figure 1C).

      We now provide representative images alongside what would be expected to be observed by eye during screening.

      (4) It would be useful to describe how to recover worms from what is shown in Figure 1A. 

      In the revised version, we added the following in the caption for Fig. 1A:

      “Selected worms expressing the brighter tag can be screened for dimmer tags by higher magnification and long exposure imaging. Worms can be recovered directly from slides if immobilized by levamisole as described (Ghanta et al., 2021). Alternatively, single hermaphrodite worms can be isolated, allowed to lay eggs, then screened.”

      (5) A blue bar of data must be missing from Figure 3B injection pool 5.

      As stated in the text, “All but one tag (cox-6B::mTagBFP2) was visible in the F1 generation of injected P0 animals, and these were subsequently isolated among F2 worms positive for the other tags in the pool.”

      To clarify that data points are not unintentionally omitted, we added the following text to the caption of Fig. 3B:

      “For group 5 including cox-6B::mTagBFP2, worms with detectable levels of mTagBFP2 fluorescence were not recovered in the F1 generation but were isolated among progeny of F1s positive for mStayGold and mScarlet3; we were thus unable to quantify efficiency for this locus at F1.”

      (6) Some expression or localization patterns were unexpected, but complications like germline silencing and protein mislocalization, with a small fraction localizing normally and rescuing function, were not presented as possibilities. Viability is used to confirm function, but without presenting whether this means 100% viability, less, or just the ability to maintain a strain.

      We already do discuss mislocalization and functionality issues in the Discussion, as well as tradeoffs of alternate methods. Any existing method to observe biological molecules, be it protein, RNA or DNA, has multiple drawbacks and sources of artifacts, which are unlikely to be fully eliminated in the foreseeable future.

      In regard to germline silencing of endogenously tagged genes in C. elegans, there is actually very little evidence for this. Collectively, various labs have now generated over 200 reporter alleles of germline-expressed genes (WormTagDB), with robust expression throughout the germline and retention of function. Likewise, numerous of our tags across fluorophores showed robust germline expressions including EEF-1A.1::mTagBFP2, Y22D7AL.10::mStayGold, and HAT-1::mScarlet3. In fact, overall transcript levels generally tended to underestimate germline enrichment at the protein level. We note that single-copy transgenes driven by eef-1A.1/eft-3 promoter by itself are frequently not expressed in the germline (PMID: 31064766); that we could detect EEF-1A.1 robustly in the germline when tagged endogenously is evidence that silencing is unlikely to be a widespread concern, and at the least less of a concern than single copy transgenes. We appreciate that for a transgene, presence/absence of specific sequence elements and genomic loci play a role in expression, but an endogenous tag captures all such information at a given locus.

      Indeed, we found only two reports of endogenous tags being silenced in the germline, the first being a novel tag (not fluorophore) which initially prevented expression at the tagged locus (PMID: 30109984), but after making changes to the sequence to avoid silencing signals the authors could rescue expression and thereafter saw robust expression in various novel contexts with this tag. The second example (PMID: 34547227) leaves open the possibility that germline repression of that particular gene might be a part of its endogenous regulation.

      Nevertheless, given it is probably rare if occurs at all, it will likely take a large scale tagging effort to uncover such cases at sufficient numbers to study. In our view, this further justifies tagging at large, ideally genomic, scales. If we do discover that there are numerous annotated germline proteins which we don’t observe by tagging, that would be interesting to study on its own.

      (7) Halotag is presented in the Discussion as a small tag, but it is bigger than GFP.

      Thank you for catching this. We have removed the discussion of Halotag. Given the comparable size to FPs, it would be unlikely to alleviate issues of tag functionality.

      (8) It would be useful to include FPKMs and viability percentages in Table 1.

      FPKM is included in column 6, but the title for this column is cut off. In the revised table FPKM values are now shown more clearly across stages.

      We did not quantify viability percentage. In our view it does not yield an informative metric when there is little information about the protein’s required dosage for function, which was the case for most proteins here. A haplosufficient gene might yield a full brood size even if 50% of protein function is lost; conversely, a highly dose sensitive protein could yield penetrant and severe inviability with mild perturbation of function. It also is not actionable information at this stage if there is no alternate tagging strategy as a baseline of comparison. The worms we picked to image all have viable embryos as adults, so in those individuals the genes were likely to be sufficiently expressed and functional.

      (9) Because establishing that a guide works well is a limiting step for many CRISPR experiments (once a guide works well, it's easy to inject 5 worms and get lines), I wondered if testing that for many genes is what is really needed in the field at this stage. 

      Guide quality is rarely an issue in C. elegans, as for all the genes here we tried only one guide, all of which were previously untested. We now clarified this in the discussion section:

      “Notably, we find that previously untested guide RNAs and homology arms perform exceptionally well at novel loci, as we only tested one set of reagents for each locus which yielded satisfactory tagging rates.”

      (10) For a manuscript where the injection is so central to what was done, I was surprised to read in the Acknowledgments that all of the injections were done by someone who is not included as an author.

      We are likewise surprised by such a comment but gladly clarify: Chi Chen has been with us as an expert microinjection specialist for more than 25 years and her very important technical contributions have been acknowledged in many dozen papers. Multiple authorship guidelines, including COPE’s and ICMJE’s, state that technical contributions alone do not qualify for authorship.

      Reviewer #2 (Recommendations for the authors):

      (1) We would encourage the authors to provide systematic validation of the reported insertions. The manuscript reports that 24 of 30 tags were isolated and visible, but does not clearly state whether each isolated line was confirmed by sequence‑level validation to be correctly in‑frame and free of unintended mutations at the target locus.

      We appreciate the reviewer’s concerns on fidelity. These parameters have been assessed in prior published work (e.g., PMID: 30504364, PMID: 34748534) and in our hands are in the range of 80% whenever we sequence non-fluorescent tags of similar sizes. The efficiencies we observed are high enough that one can expect to recover numerous worms with the exact intended sequence for each target, though we would argue mutations within the FP reporter are less likely to matter if it retains high fluorescence.

      (2) The manuscript presents aggregated success counts (e.g., 8/10 mTagBFP2 tags, 9/10 mStayGold, 7/10 mScarlet3) and useful narrative descriptions of injection outcomes. We also suggest including per‑locus success rates.

      Figure 3B shows per locus success rate and source data is provided for this figure. Each dot is an individual injection and the Y axis is per locus rate. We now worded this more clearly in the figure’s caption.

      “Total insertion efficiencies per locus for the indicated targets across injection pools.”

      (3) For pools that required re‑injection after initial failures, we would like to see a description of the specific changes that were made to the injection mixes or procedures (e.g., new repair template prep, different Cas9 reagent lot, guide redesign). This will be useful troubleshooting information for others.

      We re-made the exact same injection mix but with nanodrop to ensure the purity of the repair templates as assessed by absorbance ratios (A260/230 and A260/280) were sufficient after each purification step. No other changes were made. This is now specified in the methods section in the following way:

      “For re-runs of pools 4, 6 and 10 which failed initially, we regenerated the repair templates and ensured that after each column purification, the A260/230 ratio of the purified DNA was ≥2.2 and A260/280 was 1.8 ± 0.05 when measured with a Nanodrop spectrophotometer.”

      (4) The authors state that the fluorophore sequences are codon-optimized for C. elegans. We suggest they provide the exact donor/tag sequences, specifically state whether the fluorophore sequences contain any synthetic/artificial introns, or whether other sequence modifications (e.g., silent PAM‑disrupting mutations) were included in the donor templates. 

      This information is provided in Supplementary Table 1.

      (5) Page 3: Include a reference for "The C. elegans genome encodes around 20,000 genes" 

      We added a reference to the most recent release of the genome (WS237, May 2013). Spieth et al., 2014.

    1. eLife Assessment

      This important paper substantially advances our understanding of how Molidustat may work, beyond its canonical role, by identifying its therapeutic targets in cancer. This study presents a compelling and well-structured investigation into the therapeutic vulnerabilities of APC-mutant colorectal cancer. This work will be of broad interest to the cancer community in studying small molecules and their therapeutic targets.

    2. Reviewer #1 (Public review):

      [Editor's note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      The authors aimed to uncover novel therapeutic vulnerabilities in APC-mutant colorectal cancer (CRC), which constitutes the majority of CRC cases. They hypothesized that modulating oxygen-sensing pathways (via PHD inhibition) could disrupt adaptive stress responses in these tumours.

      Strengths:

      The study employs a powerful, two-pronged approach to identify Molidustat's targets. By using both Thermal Proteome Profiling (TPP) and an orthogonal chemical proteomic competition assay, the authors provide compelling evidence that GSTP1 is a genuine, direct off-target, effectively addressing the common limitation of indirect effects in proteomic screens.

    3. Reviewer #2 (Public review):

      Summary:

      The authors aimed to determine Molidustat targets and the potential utility of these findings. They clearly demonstrate that Molidustat interferes with GSTP1 and some other proteins on top of PHD2. They also demonstrate that PHD2 deletion is not sufficient to recapitulate Molidustat effects in cells and proteomes. Finally, they demonstrate synthetic lethality in organoids for Molidustat and APC deletion.

      Strengths:

      The data on Molidustat proteomes, GSTP1 binding, inhibition and metabolic health of organoids is really clear. All biochemical, docking and omic data are really strong. The potential impact of these findings could be the use of Molidustat in APC null tumours and awareness of potential off-target effects.

    4. Reviewer #3 (Public review):

      In this paper, the authors revealed that Molidustat can induce a dose-dependent increase in Caspase-3/7 activity in the HT29 cell line, which is an APC-mutant colorectal cancer cell line. More importantly, they found that targeting PHD2 alone cannot cause cell death. By using thermal proteome profiling (TPP) and orthogonal chemical proteomic competition assays, they determined GTSP1 as a previously undiscovered off-target of Molidustat. They also revealed that combined PHD2 and GSTP1 loss leads to an increase in intracellular ROS and apoptosis. Moreover, they evaluated the effects of Molidustat in colonic organoids and showed that Molidustat has a high selectivity for colonic organoids with activated WNT signaling and/or KRAS pathway alterations, and this effect is not reproduced by hydroxylase inhibition alone, providing a new potential approach to targeting both PHD2 and GTSP1 for the treatment of APC-mutant CRC.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aimed to uncover novel therapeutic vulnerabilities in APC-mutant colorectal cancer (CRC), which constitutes the majority of CRC cases. They hypothesized that modulating oxygen-sensing pathways (via PHD inhibition) could disrupt adaptive stress responses in these tumours.

      Strengths:

      The study employs a powerful, two-pronged approach to identify Molidustat's targets. By using both Thermal Proteome Profiling (TPP) and an orthogonal chemical proteomic competition assay, the authors provide compelling evidence that GSTP1 is a genuine, direct off-target, effectively addressing the common limitation of indirect effects in proteomic screens.

      Weaknesses:

      (1) In Figure 1, the current data rely on a single guide RNA (sgRNA). To make the data solid, at least two independent sgRNAs targeting different regions of PHD2 should be used.

      We thank the reviewer for raising this. Clarity on the CRISPR strategy was missing from the original submission and we have now added the following to the Methods (Page 4). We did not use a single sgRNA. PHD2 was targeted with a pool of three chemically modified crRNAs:

      (IDT Alt-R; target sequences: 5'-TACAACCAGCATATGCTACA, 5'GTGGCTGCCGAAGCCGAGCC, 5'-GATAAGATCACCTGGATCGA)

      Delivered as in vitro assembled ribonucleoprotein complexes with high-fidelity Cas9. This format has been reported to achieve high on-target efficiency while minimising off-target cutting [1,2] such that any residual stochastic off-target events are distributed across the population and are not expected to manifest as a coherent phenotype at the population level. Working with pooled, unselected knockouts rather than single-cell clones also avoids the confounds of clonal heterogeneity that normally motivate the use of multiple independent guides and rescue experiments in single-clone workflows. We have previously validated this approach for GSTP1 knockout in a separate single-cell proteomics study [3], where loss of GSTP1 protein was observed in over 90% of single cells and GSTP1 was the most significantly altered protein between sgControl and sgGSTP1 populations.

      (2) Figure 3E: Asn205 site should be mutated to prove that whether Molidustat inhibits GSTP1 activity via Asn205 or not.

      This is a good suggestion, and we explored it in silico before concluding it was not tractable. We used PyMol mutagenesis to model Molidustat binding to GSTP1 variants at the predicted contact residues: Asn205 was mutated to Ala, Gly and Ser; Trp39 (predicted to hydrogen-bond Molidustat) was mutated to Ala, Phe and Thr; and a Tyr8Phe/Asn205Ser double mutant was also modelled. In every case, Molidustat reoriented within the active site and adopted an alternative hydrogen-bonding configuration (most commonly with Tyr8), yielding a docking score equal to or better than binding to native GSTP1 (Author response image 1– Author response image 4). The model therefore does not predict any single or double point mutant that would ablate Molidustat binding in a clean, interpretable way, and we could not design a rational loss-of-interaction mutant on this basis. Given this limitation, and that definitive mapping of the binding interface would require co-crystallography, which is beyond the scope of the present study, we have moved the docking model to the supplement and flagged it as predictive rather than definitive.

      Author response image 1.

      Molidustat in native GSTP1

      Author response image 2.

      Molidustat docking with mutated GSTP1, Asn205 mutated to Gln205

      Author response image 3.

      Molidustat docking with mutated GSTP1, Tyr39 mutated to Phe39

      Author response image 4.

      Molidustat docking with mutated GSTP1, Asn205 mutated to Ser205 and Tyr8 mutated to Phe8

      (3) Figure 5B and 5C: The metabolic imbalance phenotype observed upon dual knockout of PHD2 and GSTP1 requires rescue experiments to confirm on-target specificity.

      We thank the reviewer for this important point and agree that rescue experiments could represent the most direct demonstration of on-target specificity for the metabolic phenotype observed in Figures 5B and 5C. These rescue experiments are necessary when working with single clones, as they allow for comparing a knock-out clone with a reconstituted pool and sidestep the issue of clonal heterogeneity.

      In our case, we think that there is no advantage to doing so, as we work with pooled knockouts, so any clonal heterogeneity is diluted in the pool.

      One could even make the case that such a rescue experiment would introduce additional artefacts. Combined loss of PHD2 and GSTP1 leads to reduced cellular viability, with decreased proliferation and increased apoptosis, consistent with a synthetic lethal interaction. To devise a rescue experiment, we would have to isolate a single-cell clone (the pool is not a complete 100% knock out, WT cells would outgrow the knock out cells). The isolation of such a clone that has overcome the anti-proliferative insult of the double knockout is likely to have a phenotype distinct from the original, pooled population, as would the rescued have from the WT cells. For these reasons, we have not performed rescue experiments in the current study. We have added the absence of a rescue as a limitation to the study in the discussion

      “While genetic rescue experiments would provide definitive confirmation of on-target specificity, the pronounced loss-of-fitness and apoptotic phenotype observed upon combined PHD2 and GSTP1 loss limited the feasibility of establishing stable rescued double-knockout populations, and therefore represents a limitation of the current study.”

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to determine Molidustat targets and the potential utility of these findings. They clearly demonstrate that Molidustat interferes with GSTP1 and some other proteins on top of PHD2. They also demonstrate that PHD2 deletion is not sufficient to recapitulate Molidustat effects in cells and proteomes. Finally, they demonstrate synthetic lethality in organoids for Molidustat and APC deletion.

      Strengths:

      The data on Molidustat proteomes, GSTP1 binding, inhibition and metabolic health of organoids is really clear. All biochemical, docking and omic data are really strong. The potential impact of these findings could be the use of Molidustat in APC null tumours and awareness of potential off-target effects.

      Weaknesses:

      A main but minor weakness is that Molidustat also inhibits other PHDs, although these are less expressed. PHD1 has been shown to control the cell cycle and be expressed in the colon, where it is needed for viability. Although this does not explain the lack of effect of other PHD inhibitors, it does warrant some discussion. The use of MTT is not very good to detect viability when it measures metabolism; this also needs to be discussed and perhaps supplemented with colony or cell number measurements.

      Great point, for this reason, we have assayed apoptosis throughout. In addition, we have added a clonogenicity assay with APC organoids. Organoid cells were treated with an acute dose of Molidustat. We subsequently measured the level of Lgr5 (a stem cell marker) and of the ability of the cells to generate organoids (these data have been added as Figure 5 F-G.)

      Reviewer #3 (Public review):

      In this paper, the authors revealed that Molidustat can induce a dose-dependent increase in Caspase-3/7 activity in the HT29 cell line, which is an APC-mutant colorectal cancer cell line. More importantly, they found that targeting PHD2 alone cannot cause cell death. By using thermal proteome profiling (TPP) and orthogonal chemical proteomic competition assays, they determined GTSP1 as a previously undiscovered off-target of Molidustat. They also revealed that combined PHD2 and GSTP1 loss leads to an increase in intracellular ROS and apoptosis. Moreover, they evaluated the effects of Molidustat in colonic organoids and showed that

      Molidustat has a high selectivity for colonic organoids with activated WNT signaling and/or KRAS pathway alterations, and this effect is not reproduced by hydroxylase inhibition alone, providing a new potential approach to targeting both PHD2 and GTSP1 for the treatment of APC-mutant CRC.

      Specific comments:

      (1) What is the possible molecular mechanism of dual GSTP1/PHD2 loss, inducing cell death?

      This is an important question. Our data support a model in which combined loss of GSTP1 and PHD2 disrupts cellular redox homeostasis, leading to accumulation of reactive oxygen species, increased GSSG/GSH ratios, and depletion of antioxidant buffering capacity. This redox imbalance is accompanied by downregulation of pro-survival pathways. In this context, activation of apoptotic signalling, as evidenced by increased caspase-3/7 activity and proteomic enrichment of apoptosis-associated pathways, contributes to the observed cell death phenotype.

      While apoptosis is supported by our data, the magnitude of oxidative stress suggests that additional oxidative stress-associated cell death mechanisms may also contribute. We have clarified this point in the Discussion (Page 11).

      (2) Can the authors mutate the binding site of Molidustat on GTSP1 to verify the in silico docking results?

      This is a very important question. Currently, the model is of limited value. Reviewer 1 had a similar question. Can we refer you to Reviewer 1, question 2.

      (3) Evidence for Molidustat inhibiting PHD2 activity or stabilising HIF-1α should be provided.

      We thank the reviewer for this suggestion. Data showing HIF-1α stabilisation and evidence of downstream signalling is now added to Supplementary Figure 1.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I only have minor suggestions:

      Molidustat also inhibits other PHDs, although these are less expressed. PHD1 has been shown to control the cell cycle and be expressed in the colon, where it is needed for viability. Although this does not explain the lack of effect of other PHD inhibitors, it does warrant some discussion. The use of MTT is not very good to detect viability when it measures metabolism; this also needs to be discussed and perhaps supplemented with colony or cell number measurements.

      This is correct, PHD1 is of particular interest, given the effects inhibition/knock-out has on the inflamed colon. We have added a new paragraph to the Discussion (Page 13) that addresses the isoform selectivity of Molidustat. We note that, although developed as a PHD2 inhibitor, Molidustat retains appreciable activity against PHD1 and PHD3 [4], and we discuss the non-redundant and in some contexts opposing roles of PHD1 and PHD2 in the colon, PHD1 loss is protective in DSS colitis [5] and restrains colitis-associated tumour growth, whereas PHD2 loss in the tumour and stroma is reported to inhibit metastasis and treatment response [6]. We further note that this pattern of isoform engagement is shared with other pan-PHD inhibitors that did not phenocopy Molidustat in our screens, indicating that PHD isoform profile alone is insufficient to explain Molidustat’s distinctive activity and pointing to GSTP1 off-target engagement as the key distinguishing feature. We argue that localised colonic delivery (as discussed earlier in the Discussion) would concentrate drug at the APC-mutant epithelium while limiting systemic exposure.

      We fully agree with the reviewer, MTT measures metabolic activity/NADH levels rather than viability in the strict sense, and that this is particularly relevant for a compound that perturbs redox metabolism. We have added a clonogenicity assay in APC organoids (Fig. 5 F-G) to supplement the MTT and Cleaved Caspase 3 assays already present in the manuscript.

      (1) Lee, J. K. et al. Directed evolution of CRISPR-Cas9 to increase its specificity. Nat. Commun. 9, (2018).

      (2) Sakovina, L., Vokhtantsev, I., Vorobyeva, M., Vorobyev, P. & Novopashina, D. Improving Stability and Specificity of CRISPR/Cas9 System by Selective Modification of Guide RNAs with 2′-fluoro and Locked Nucleic Acid Nucleotides. Int. J. Mol. Sci. 23, (2022).

      (3) Makar, A. N., Holkham, J., Lilla, S., Wilkinson, S. & von Kriegsheim, A. Overcoming preservation challenges to enable single-cell proteomics of fixed cell and tissue samples with retained proteome integrity. Preprint at https://doi.org/10.1101/2025.03.10.642380 (2025).

      (4) Flamme, I. et al. Mimicking hypoxia to treat anemia: HIF-stabilizer BAY 85-3934 (molidustat) stimulates erythropoietin production without hypertensive effects. PLoS One 9, (2014).

      (5) Tambuwala, M. M. et al. Loss of prolyl hydroxylase-1 protects against colitis through reduced epithelial cell apoptosis and increased barrier function. Gastroenterology 139, (2010).

      (6) Leite de Oliveira, R. et al. Gene-Targeting of Phd2 Improves Tumor Response to Chemotherapy and Prevents Side-Toxicity. Cancer Cell 22, (2012).

    1. eLife Assessment

      This study used pupillometry to provide an objective assessment of a form of synesthesia in which people see additional color when reading numbers. It provides convincing evidence that subjective color ratings are matched by changes in pupil size that recapitulate brightness-mediated changes when exposed to the real color. The work provides a valuable contribution to the literature on both synesthetic perception and the use of pupillometry to probe perception and related psychological processes.

    2. Reviewer #1 (Public review):

      Summary:

      Knowing that small pupil-size variations accompany brightness variations (even when these are illusory), the authors asked whether pupil constrictions would accompany the synesthetic perception of a brighter color (compared with a darker one), induced by the presentation of a black-white character. This grapheme-colour synesthesia is only experienced by few participants, sixteen of whom were enrolled in this study. The results reliably showed that a relative pupil constriction would "betray" the perception of a brighter color in these participants, while no such effect would be observed in control participants who were asked to report a color in association with each grapheme, even though they did not perceive any.

      Strengths:

      The main strength of the study lays in its combination of psychophysics (brightness ratings) and pupillometry, which allowed for showing clear-cut results.

      Weaknesses:

      I only see the following relatively minor weaknesses, namely:

      - The pupil traces in Figure3 (main results) are heavily pre-processed (per-participant demeaned), loosing any feature besides the effect of interest. As I argued in my first review, I worry that this format gives unrealistic expectations about the effect (the perception of dark/bright colors do not generate a net dilation/constriction of the pupil; perception-related modulations of pupil size are always relative and generally small compared to the numerous other effects registered in pupil size; these include a pupil dilation that is more prominent in the controls and that gets analyzed later on in the manuscript; I do not think that eliminating one of the effects of interests from a main results figure helps the reader understand the results). In the revised manuscript, the authors addressed this concern by adding a Supplementary Figure 4, where a more complete representation of the results is shown (traces from individual trials are baseline corrected and averaged, resulting in more informative timecourses). I would strongly recommend that Supplementary Figure4 is brought to the main text (Figure3 could be presented in Supplementary).

      - Responses to physical brightness modulations were only measured in the synesthethes group, not in controls. The authors point out that pupillary light responses have been thoroughly characterized in previous studies, and conclude that synesthethes' responses were in line with the expectations both in terms of amplitude and latency. However, as we are not dealing with standardized measurements, subtle differences in pupil reactivity across the two populations remain a possibility. I recommend that this possibility is mentioned in the discussion.

      Impact:

      This work is likely to improve our understanding of synesthesia, providing a new tool to quantify the subjective sensations; an interesting potential extension would be using pupillometry for tracking changes over time of the synesthetic experiences, opening up the possibility to evaluate the importance of learning for this peculiar experience.

    3. Reviewer #2 (Public review):

      Synesthesia is a neurological condition where stimulation of one sensory channel leads to involuntary, automatic, and consistent experience of another, unrelated percept. For example, Sir Francis Galton (1880, Nature) famously described the robust tendency of some individual (synesthetes) to associate numerals with a distinct color. Ever since, synesthesia keeps attracting a broad interest in the cognitive neurosciences in light of its implications for the study of domains such as perception, consciousness, and brain connectivity, among others.

      Strauch, Leenaars, and Rouw measured pupil size in a group of 16 grapheme-color synesthetes and two matched control groups. The participants were presented with gray digits - that is, visual stimuli having identical physical properties in terms of brightness. Each participant subsequently rated the corresponding evoked color and brightness: unlike controls, synesthetes did so in a very consistent and reliable fashion. Accordingly, this was also shown in their pupils: despite the same objective luminance, digits associated with brighter percepts caused their pupils to constrict and digits associated with darker percepts caused their pupils to dilate more than controls. These results highlight how crossmodal correspondences are deeply rooted in synesthetes, and puts forward pupillometry as a particularly appealing biomarker for some phenomenological experience (at least those grounded in "brightness").

      Further strengths of the technique are its temporal resolution and its responsiveness to several constructs. Across several tasks, the authors show for example that responses to synesthetic light are somewhat slower than responses to real light (i.e., they are likely mediated), but at the same time faster than responses to mental imagery. The role of mental imagery can also be reasonably dismissed when considering the second feature of pupil size: its responsiveness to mental effort and cognitive load. The pupils tend to dilate with demanding, challenging tasks, and this was the case when control participants were asked to report the color of a digit for which they did not consistently experience a synesthetic association. The same task was, instead, seemingly effortless for synesthetes, again speaking in favor of the automaticity of number-color correspondences in their case.

      Overall, the findings by Strauch, Leenaars, and Rouw are highly significant for the field and likely to be impactful. The strength of their evidence, when accounting for the relatively small sample size and the inherent variability of both phenomenology (color perception and subjective reporting) and physiology (pupil size), is adequate and sufficiently convincing.

      Comments on revisions:

      I thank the authors for addressing all my comments in a satisfactory way. I think that the paper has improved, especially in terms of transparency of the reporting and clarity of the results.

    4. Reviewer #3 (Public review):

      Summary:

      In the present study, the authors examined pupillary responses to uncolored stimuli (number graphemes) among number-color synesthetes and non-synesthetes. After seeing a digit, the synesthetes and active control participants were asked to indicate which color they perceived using three dimensions of hue, saturation, and lightness. The lightness values were the primary independent variable for follow-up analyses. To see how the pupil responded to psychologically "bright" and "dark" digits, the authors split the reported lightness values at the median and plotted them. The synesthetes showed a pupillary constriction to digits they perceived as bright and dilation to digits they perceived as dark. Active control participants did not show that effect. In a subsequent block, only the synesthetes were shown the colors they reported perceiving as colored discs. Their pupillary responses were similar. The authors also found that the differences in pupillary responses between light and dark perceptions (with digits) were only slightly delayed in their onset to the perception of a colored disc, and therefore the color perception accompanying a digit is unlikely to be effortful or a retrieved association, but occurs rather automatically.

      Strengths:

      The authors employed a well-controlled and designed quasi-experiment comparing color-grapheme synesthetes to non-synesthetes and showed convincingly that the color perceptions accompanying graphemes alter the physical perception of brightness. They also made a reasoned attempt to ruled out the possibility that color associations are occurring effortful via retrieved associations.

      The follow are questions which I had asked in a first round of reviews, and which were answered adequately by the authors:

      (1) Are the pupillary responses among synesthetes, which objectively do not seem to match the degree of physical stimulation entering the retina, in any way maladaptive for eye functioning? I understand the constriction/dilation of the pupil to not only benefit visual acuity but also to protect the retina from damage. Are synesthetes at any risk of retinal damage due to over-dilation of the pupil to brighter stimuli? Or are these effects of a magnitude that is too small to matter? As reported in arbitrary units, it was hard to know how large these effects were in terms of measurable changes in dilation (e.g., millimeters).

      (2) Likewise, is the automatic synesthetic merging of two percepts something that could be learned such that natural synesthetes and "artificial" synesthetes would look similar? For example, if a group of non-synesthetic participants were to learn a color-grapheme association to automaticity, would you expect their pupillary responses to the graphemes look similar to the synesthetes? If so (or if not), what would this tell us anything about the phenomenology of synesthesia?

      (3) Do the synesthetic perceptions of digit graphemes merge in a sensible way? For example, if a synesthete sees a particular color with the digit 1, and a different color with the digit 9, what do they perceive when they see 19? or 1-9, or 1 9? Is there color blending, or an altogether different color perception?

    5. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study used pupillometry to provide an objective assessment of a form of synesthesia in which people see additional color when reading numbers. It provides convincing evidence that subjective color ratings are matched by changes in pupil size that recapitulate brightnessmediated changes when exposed to the real color. The work provides a valuable contribution to the literature on both synesthetic perception and the use of pupillometry to probe perception and related psychological processes.

      We were pleased to learn that our manuscript was of interest to the reviewers and the editor. We thank the reviewers for their useful feedback and have addressed all their comments in the revised version. We here give the most prominent changes as quotes.

      We thank all reviewers and for their very helpful input.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Knowing that small pupil-size variations accompany brightness variations (even when these are illusory), the authors asked whether pupil constrictions would accompany the synesthetic perception of a brighter color (compared with a darker one), induced by the presentation of a blackwhite character. This grapheme-colour synesthesia is only experienced by a few participants, sixteen of whom were enrolled in this study. The results reliably showed that a relative pupil constriction would "betray" the perception of a brighter color in these participants, while no such effect would be observed in control participants who were asked to report a color in association with each grapheme, even though they did not perceive any.

      Strengths:

      The main strength of the study lies in its combination of psychophysics (brightness ratings) and pupillometry, which allowed for showing clear-cut results.

      Weaknesses:

      Some relatively minor weaknesses concern the ancillary analyses, which tackle secondary questions and are not entirely convincing.

      (1) The linear mixed model approach is a powerful way to identify important variables, but it does not clarify whether the key factors are between-subject or between-trial variations. Some variables are inherently defined at a subject level (e.g., PA scores), others are not. I would strongly recommend an alternative visualisation of the results to examine inter-individual variability.

      Visualizing the highly idiosyncratic effects is indeed challenging. Addressing R1’s point 4 and a point brought up by R2, we updated all figures to now visualize pupil size in millimeters instead of arbitrary units. Furthermore, we added a supplementary figure (supplementary figure 4) that visualizes pupil size change without demeaning (please see reply to point 4).

      To get a better grasp of the interaction between lightness and coupling strength, we further included the supplementary figure 5 that splits by lightness and coupling strength in synesthetes.

      Furthermore, as this review and response will be publicly available, Author response image 1 provides participant-mean traces per lightness bin in addition to the overall means and hopefully makes the stability/variability of effects visually clearer (in addition to the strip plots that attempt this for the average response).

      Author response image 1.

      We hope that these additional visualizations make the effects of interest more transparent. Ultimately, however, the LME figure likely provides the information best, albeit at the cost of complexity.

      (2) It is not clear why taking the first derivative of pupil size in Figure 5 would isolate the effect of arousal, eliminating those of luminance and contrast changes (in fact, one could argue for the opposite, since arousal effects are generally constant for extended periods of time while contrast effects are typically more local and transient).

      First, please note that the results in 2.3.1 cannot be explained by task or context effects such as luminance and contrast: the exact same active color reporting task (same task and context) was presented to synesthetes and non-synesthetes.

      Indeed, the reviewer is correct that the first derivative does not eliminate other concurrent pupil-driving effects, that was expressed wrongly in our original text. Indeed, any stimulus-locked effect, such as the luminance and contrast effects, but also the effort effect will reflect similarly in the derivative measure.

      We did take the derivative because pupil responses driven by other non-trial related activity, such as increasing tiredness or excitement over the course of trials differ almost by necessity between participants, thus creating variability. However, these effects are most likely happening at a slower timescale and thus show less in the derivative measure. Accordingly in past research, we previously found clearer response-locked effects in the past when using a derivative measure (Douze et al., 2025; Ten Brink et al., 2024). This way, we also hoped to get rid of such variability that happens between participants for this between participant analysis.

      Even if we were to use the same baseline corrected analysis, we would arrive at the same conclusion: we here directly compared baseline-corrected pupil sizes by taking individual differences into account (using a LME). In other words, we tested for the same question, but not relying on the derivative. We thus compared baseline-corrected pupil sizes using over-time LMEs. Group (active control vs. synesthete) gained significance between ~1.7s and 3s, aligning with the derivative-based result.

      Author response image 2.

      t-values of a per-time point LME predicting pupil response from group (synesthete/active control) Group reached significance.

      In sum, we deem the derivative more powerful/more appropriate in this context, but the interpretation of findings does not hinge on that analysis choice (as can be seen in the Author response image 2).

      We corrected the claims on the derivative as a measure cleaning out other effects that indeed was oversimplified as it stood. We now write:

      “Mental effort presents in task-evoked pupil dilations, yet other factors simultaneously affect the pupil, such as luminance and contrast changes at trial onset, as well as slower trends across the session (e.g., fatigue). To reduce the influence of these slower, non-trial-locked fluctuations while retaining the trial-evoked dynamics, we calculated the first derivative of the pupil time course to assess the velocity of pupillary changes (Butterworth filter, 18 Hz, order 3, 2.5 Hz lowpass, following our previous works [60, 61]).”

      Douze, B. T., Ten Brink, A. F., Dijkerman, H. C., & Strauch, C. (2025). Pupil responses objectively index pharmacologically altered tactile sensitivity. Cortex, 193, 90-104.

      Ten Brink, A. F., Heiner, I., Dijkerman, H. C., & Strauch, C. (2024). Pupil dilation reveals the intensity of touch. Psychophysiology, 61(6), e14538.

      (3) It is a pity that responses to physical brightness modulations were only measured in the synesthete group, not in controls, as this would have allowed for ruling out differences in pupil reactivity across the two populations.

      The reviewer is correct that this would allow additional comparisons, but argue that light responses in healthy control samples are very well documented and stereotypical. For instance, Bergamin & Kardon (2003) provide very systematic latency estimations, for low-luminance change stimuli in the realm of about 320ms that can accelerate to about 250ms for very strong luminance changes. Our relatively small luminance increments should thus be expected in this range. Indeed, this also well describes the response latencies we observed in synesthetes when exposed to the colored disks. While there is no detailed information about participants in Bergamin & Kardon (2003), data from previous studies shows very similar pupil light response profiles in a healthy student control population that matches our synesthetes well demographically (Strauch, Romein et al., 2022 Figure 2a, exact same lab as for the present study; Koevoet et al., 2025 Figure 3a). See also the further responses, baseline pupil size in millimeters across groups did not differ.

      Together, we can safely conclude that pupil light responses in synesthetes are not different from pupil light responses in controls. We agree with the reviewer that this is a sensible point to also make in the manuscript:

      “Specifically, pupil size first responded significantly to physical luminance after 330 ms (see Supplementary Figure 7 for per-timepoint LME; in line with response latencies of similar control populations, see Bergamin & Kardon [52], Koevoet et al. [40], and Strauch et al. [53]), but only responded significantly to synesthetic lightness at about 870 ms (see also Figure 3c vs e and Figure 4 for per-timepoint LME)”.

      Bergamin, O., & Kardon, R. H. (2003). Latency of the pupil light reflex: sample rate, stimulus intensity, and variation in normal subjects. Investigative Ophthalmology & Visual Science, 44(4), 1546-1554.

      Koevoet, D., Naber, M., Strauch, C. & Van der Stigchel, S. Presaccadic Attention Shifts Up-and Downwards: Evidence From the Pupil Light Response. Psychophysiology 62, e70047 (2025).

      Strauch, C., Romein, C., Naber, M., Van der Stigchel, S., & Ten Brink, A. F. (2022). The orienting response drives pseudoneglect—Evidence from an objective pupillometric method. Cortex, 151, 259-271.

      (4) Another concern is with the visualisation of the pupil traces in Figure 3 (main results); these were heavily pre-processed (per-participant demeaned), losing any feature besides the effect of interest and generating the unrealistic expectation that perception of dark/bright colors generate a net dilation/constriction of the pupil - whereas perception-related modulations of pupil size are always relative and generally small compared to the numerous other effects registered in pupil size. It would be far better to see the actual profiles, preserving the unfolding of dilations and constrictions over time, especially since these are further analysed in Figures 4 and 5.

      Indeed, the expectation that any dark synesthetic experience would lead to pupil dilation whereas any bright synesthetic experience would lead to constriction is not warranted – it would only do that relative to the counterfactual of not having that experience.

      Many factors affect the pupillary signal at the same time, and often differently across individuals (think of tiredness etc.), making merely baseline corrected traces seemingly noisy. Our visualization highlights that there is a systematic part to that variation that lies in the synesthetic brightness experience.

      Visualizing the effects of idiosyncratic experiences, varying within and between participants is challenging. For the theoretical insight brought about through our paper in Figure 4 (synesthesia being sensory in nature), demeaning is favorable in our opinion as it isolates the effect of interest in visualization. However, for methodological reasons and to better show effect sizes etc., there is certainly use in additional transparency. We now thus provide non-demeaned traces in the supplementary material as the reviewer suggested and also refer to these in the main manuscript. Furthermore, all figures are now provided in millimeters, with all pupil related analysis being rerun and updated to this end (without qualitative changes to the results). This should further rectify possibly inflated expectations about the absolute size of effects and allows to put effects into perspective across studies. We now added:

      “Pupillary data were transformed from arbitrary eyelink units to millimeters using a conversion factor obtained with an artificial eye (see Hayes & Petrov, 2016).”

      Hayes, T. R., & Petrov, A. A. (2016). Mapping and correcting the influence of gaze position on pupil size measurements. Behavior research methods, 48(2), 510-527.

      Impact:

      Despite these weaknesses, and especially if they are adequately addressed in the review, this work is likely to improve our understanding of synesthesia, providing a new tool to quantify the subjective sensations; an interesting potential extension would be using pupillometry for tracking changes over time of the synesthetic experiences, opening up the possibility to evaluate the importance of learning for this peculiar experience.

      We were happy to read our manuscript was evaluated this positively and hope that our replies can address the remaining smaller concerns and make findings more transparent to the readers.

      Reviewer #2 (Public review):

      Synesthesia is a neurological condition where stimulation of one sensory channel leads to involuntary, automatic, and consistent experience of another, unrelated percept. For example, Sir Francis Galton (1880, Nature) famously described the robust tendency of some individuals (synesthetes) to associate numerals with a distinct color. Ever since, synesthesia has continued to attract a broad interest in the cognitive neurosciences in light of its implications for the study of domains such as perception, consciousness, and brain connectivity, among others.

      Strauch, Leenaars, and Rouw measured pupil size in a group of 16 grapheme-color synesthetes and two matched control groups. The participants were presented with gray digits - that is, visual stimuli having identical physical properties in terms of brightness. Each participant subsequently rated the corresponding evoked color and brightness: unlike controls, synesthetes did so in a very consistent and reliable fashion. Accordingly, this was also shown in their pupils: despite the same objective luminance, digits associated with brighter percepts caused their pupils to constrict, and digits associated with darker percepts caused their pupils to dilate more than controls. These results highlight how crossmodal correspondences are deeply rooted in synesthetes, and put forward pupillometry as a particularly appealing biomarker for some phenomenological experience (at least those grounded in "brightness").

      Further strengths of the technique are its temporal resolution and its responsiveness to several constructs. Across several tasks, the authors show, for example, that responses to synesthetic light are somewhat slower than responses to real light (i.e., they are likely mediated), but at the same time faster than responses to mental imagery. The role of mental imagery can also be reasonably dismissed when considering the second feature of pupil size: its responsiveness to mental effort and cognitive load. The pupils tend to dilate with demanding, challenging tasks, and this was the case when control participants were asked to report the color of a digit for which they did not consistently experience a synesthetic association. The same task was, instead, seemingly effortless for synesthetes, again speaking in favor of the automaticity of number-color correspondences in their case.

      Overall, the findings by Strauch, Leenaars, and Rouw are highly significant for the field and likely to be impactful. The strength of their evidence, when accounting for the relatively small sample size and the inherent variability of both phenomenology (color perception and subjective reporting) and physiology (pupil size), is adequate and sufficiently convincing.

      We were glad to read this overall very positive assessment of our work and thank the reviewer for the additional non-public suggestions for improvements.

      Reviewer #3 (Public review):

      Summary:

      In the present study, the authors examined pupillary responses to uncolored stimuli (number graphemes) among number-color synesthetes and non-synesthetes. After seeing a digit, the synesthetes and active control participants were asked to indicate which color they perceived using three dimensions of hue, saturation, and lightness. The lightness values were the primary independent variable for follow-up analyses. To see how the pupil responded to psychologically "bright" and "dark" digits, the authors split the reported lightness values at the median and plotted them. The synesthetes showed a pupillary constriction to digits they perceived as bright and dilation to digits they perceived as dark. Active control participants did not show that effect. In a subsequent block, only the synesthetes were shown the colors they reported perceiving as colored discs. Their pupillary responses were similar. The authors also found that the differences in pupillary responses between light and dark perceptions (with digits) were only slightly delayed in their onset to the perception of a colored disc, and therefore, the color perception accompanying a digit is unlikely to be effortful or a retrieved association, but occurs rather automatically.

      Strengths:

      The authors employed a well-controlled and designed quasi-experiment comparing colorgrapheme synesthetes to non-synesthetes and showed convincingly that the color perceptions accompanying graphemes alter the physical perception of brightness. They also made a reasoned attempt to rule out the possibility that color associations are occurring effortfully via retrieved associations.

      We appreciate the positive assessment and useful suggestions for revision.

      Weaknesses:

      There are some areas in which the implications of these findings could be elaborated upon. I had the following questions:

      (1) Are the pupillary responses among synesthetes, which objectively do not seem to match the degree of physical stimulation entering the retina, in any way maladaptive for eye functioning? I understand the constriction/dilation of the pupil to not only benefit visual acuity but also to protect the retina from damage. Are synesthetes at any risk of retinal damage due to over-dilation of the pupil to brighter stimuli? Or are these effects of a magnitude that is too small to matter? As reported in arbitrary units, it was hard to know how large these effects were in terms of measurable changes in dilation (e.g., millimeters).

      This is an interesting point. Some argue that pupil size changes in a mid-range mildly affect optics thus affecting detection performance, contrast perception, and depth of field (Eberhardt et al., 2022, Mathôt & Ivanov 2019, Ruuskanen, Boehler, & Mathôt, 2025), rather than serving a protective role for the retina (Mathôt, 2018). Indeed, any effects reported here were quite small. We agree with the reviewer that this can be made more accessible by reporting effects in millimeters. We thus now adjusted all figures accordingly and write in the methods section:

      “Pupillary data were transformed from arbitrary eyelink units to millimeters using a conversion factor obtained with an artificial eye (see Hayes & Petrov, 2016).”

      Note that even the largest effects here (those elicited by physical luminance change in block 2 for the synesthetes) only caused differences in pupil size of about 0.3mm. This lies below the maximal pupil dilations observable in response maximal effort (about 0.5mm), for instance, and substantially below the full range of pupil size changes elicited through strong luminance stimulation (several millimeters). We therefore deem the changes in pupil size as obtained in our study too minor to be practically maladaptive for optics/perception.

      Eberhardt, L. V., Strauch, C., Hartmann, T. S., & Huckauf, A. (2022). Increasing pupil size is associated with improved detection performance in the periphery. Attention, perception, & psychophysics, 84(1), 138-149.

      Hayes, T. R., & Petrov, A. A. (2016). Mapping and correcting the influence of gaze position on pupil size measurements. Behavior research methods, 48(2), 510-527.

      Mathôt, S., & Ivanov, Y. (2019). The effect of pupil size and peripheral brightness on detection and discrimination performance. PeerJ, 7, e8220.

      Mathôt, S. (2018). Pupillometry: Psychology, physiology, and function. Journal of cognition, 1(1), 16.

      Ruuskanen, V., Boehler, C. N., & Mathôt, S. (2025). The Interplay of Spontaneous Pupil-Size Fluctuations and EEG Power in Near-Threshold Detection. Psychophysiology, 62(3), e70035.

      (2) Likewise, is the automatic synesthetic merging of two percepts something that could be learned such that natural synesthetes and "artificial" synesthetes would look similar? For example, if a group of non-synesthetic participants were to learn a color-grapheme association to automaticity, would you expect their pupillary responses to the graphemes look similar to the synesthetes'? If so (or if not), what would this tell us anything about the phenomenology of synesthesia?

      We find this question most interesting. Likely, different synesthesia researchers wouldn’t even fully agree on the most plausible answers to these questions. Training studies have shown that nonsynesthetes can be trained to associate particular colors to particular graphemes, as revealed in the synesthetic Stroop effect: interference effects of the learned color onto reporting the typeface color of the grapheme. The degree to which non-synesthetes can be trained to become similar to synesthetes is however still topic of debate.

      We now discuss as follows:

      “Future studies could examine to what degree training a non-synesthete to associate specific colors to particular inducers (e.g., digits), can provide similar patterns of results as genuine synesthesia (Bor et al., 2014, Colizoli et al., 2012, Rothen & Meier, 2014). Could learning produce similar brightness-related pupil effects in non-synesthetes? Similarly, would effort-linked responses diminish with increased training duration? The perhaps most interesting question relates to response latencies: Would a trained participant ever be able to produce brightnessrelated pupil effects as fast as a synesthete?”

      Bor, D., Rothen, N., Schwartzman, D. J., Clayton, S., & Seth, A. K. (2014). Adults can be trained to acquire synesthetic experiences. Scientific reports, 4(1), 7089.

      Colizoli, O., Murre, J. M., & Rouw, R. (2012). Pseudo-synesthesia through reading books with colored letters. PloS one, 7(6), e39799.

      Rothen, N., & Meier, B. (2014). Acquiring synaesthesia: insights from training studies. Frontiers in human neuroscience, 8, 109.

      (3) Do the synesthetic perceptions of digit graphemes merge in a sensible way? For example, if a synesthete sees a particular color with the digit 1, and a different color with the digit 9, what do they perceive when they see 19? or 1-9, or 1 9? Is there color blending, or an altogether different color perception?

      This is a very interesting question indeed. While each synesthete will have their own specific expression of synesthesia, there are regularities in how a combination of digits evokes synesthetic color. First, if asked about the color of a specific digit, each digit keeps its own color, as the color of a digit is linked to the identity of the digit (Dixon et al., 2006). Context effects are however possible, in particular when context alters the interpretation of the digit (Myles et al., 2003). A particularly common context in a multi-digit number is a dominant first digit, spreading its color to the subsequent digits in the number. However, as the digit color is linked to digit identity, what does ‘not’ happen is a mixing of colors into a qualitatively new color; for example, a yellow "1" and blue "9" do not merge into a green "19".

      Dixon, M. J., Smilek, D., Duffy, P. L., Zanna, M. P., & Merikle, P. M. (2006). The role of meaning in grapheme-colour synaesthesia. Cortex, 42(2), 243-252.

      Myles, K. M., Dixon, M. J., Smilek, D., & Merikle, P. M. (2003). Seeing double: The role of meaning in alphanumeric-colour synaesthesia. Brain and Cognition, 53(2), 342-345.

      Many thanks for the constructive assessment of our work.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I am not sure I'd use the term 'cross-modal' given that the case considered here (graphemecolor) is purely visual.

      The reviewer is absolutely right: the term 'cross-modal' has a historical background rather than reflecting an exact factual accuracy. The term is still commonly used however, as it readily reflects how the induced additional experience is always of a different (sub)type than the inducing experience. There is a cross-over between experiences that might occur within the same sensory modality, or even induce awareness of a particular concept. But key to synesthesia is the crossover experience as the inducer and concurrent are different (sub)types of experiences. For example, seeing a letter can evoke a synesthetic experience of seeing a color, or evoke awareness of a particular gender or personality of that letter, but does not evoke another letter. To remain consistent with literature, we refer to 'cross-modality' when explaining the link to previous literature, but generally switched to using 'cross-over experience':

      “Therefore, synesthesia might provide a unique window into how the brain’s constructive processes can generate additional, conscious content, in cross-over experiences, often across modalities, going all the way down to the level of sensory phenomenology.”

      We adjusted throughout the manuscript accordingly.

      (2) I would not recommend focusing the introduction on the problem of qualia; this is a much more general and complex question than the one addressed in the study; the space of the introduction may be better used to present the actual object of study, giving a better picture of the synesthetic phenomenon and of previous work aimed at characterising it (behavioural, including PA scores and consistency measures, and neuroimaging). It is important to discuss how the pupillometric approach differs from the previously adopted neuroimaging techniques and what it can add to those.

      We agree that qualia is a very general and complex question. However, we respectfully disagree that this complex question is not the object of the study. What is remarkable about synesthesia is not the presence of an additional perceptual association per se, but the presence of a specific perceptual experience. As illustration, think of a test where an unconscious color association to the word 'banana' was tested. While a generic 'yellow' could semantically be linked and would likely be obtained in the (e.g. priming) experimental results, a follow-up question of picking on a color wheel the exact shade of yellow to this association, or describing the perceptual sensation of the color, would be non-sensical to the participants.

      This sharply contrasts with the current study: synesthetes, but not non-synesthetes, indicate a perceptual sensation of additional colors, and subsequently indeed the sensory properties of this percept (experienced brightness) affects the objective reflection of this sensation (pupil size) in synesthetes but not in non-synesthetes. In our view, the presence of additional qualia is key in understanding what sets synesthetic apart from non-synesthete associations, including so-called cross-modal correspondences (unconscious consistent associations across modalities, common to us all). We even believe that the reported qualia is what makes synesthesia so interesting in the first place. We now more clearly explain this link to qualia better in the introduction.

      "The most remarkable aspect of synesthesia is the subjective perceptual phenomenology of the induced colors, setting these sensations apart from color memory, thought, or amodal association. The contrast between synesthetes and non-synesthetes can thus offer an interesting doorway into examining qualia, the subjective perceptual phenomenology or first person (what's-it-like) perspective."

      We also improved the explanation of the synesthetic phenomenon, including a more detailed characterisation of behavioural measures (including consistency scores) and added neuroimaging studies. These changes have been incorporated into the text in response to previous comments (point 1- reviewer 1).

      Please note that we have chosen not to include more detailed discussion of PA scores. Our results show a trend but do not allow for a conclusive interpretation on PA scores, and we feel that placing greater emphasis on this topic might therefore be confusing or even misleading. Still, it would be a very interesting topic for follow-up research to examine how alterations in characteristics of the synesthetic experience influence pupil responses.

      The different synesthesia types all share the defining characteristics of an additional conscious and consistent experience. Synesthetes can verbally report their additional experience, and synesthetic sensations can be measured in behavioral paradigms such as the ’synesthetic Stroop’ effect, or brain activation patterns in sensory cortex [15]. Furthermore, test-retest paradigms show how synesthetic, but not non-synesthetic associations are highly specific and consistent [16-18]. Thus, over the past decades, research has established synesthesia as a ’real’ condition that can reliably be identified using behavior, neurophysiology, and neuroimaging [11, 13, 15–21]. The most remarkable aspect of synesthesia is the subjective perceptual phenomenology of the induced additional sensation, i.e., color in grapheme-color synesthesia. This sets synesthetic sensations apart from (color) memory, thought, or amodal association. Synesthesia can thus offer an interesting doorway into examining qualia, the subjective perceptual phenomenology or first person (what’s-it-like) perspective.

      We now discuss the pupillometric approach as it differs from the previously adopted neuroimaging techniques as follows:

      “Compared to neuroimaging studies [12,15,51], pupillometry may offer a more direct window into synesthetic phenomenology, as the directionality between pupil light reflex and perceived brightness is straightforward. Finally, improved understanding of the underlying processes can be obtained by contrasting responses to perceived versus actual (physical) brightness, given that the pupil light reflex is a well-characterised reflex arc involving few inferential steps.

      This adds to the explanation that was already present on how the current approach differs from previous techniques, and what it can add to those techniques:

      "Instead, current paradigms capturing synesthesia employ objective measures, but fail to capture its phenomenology [16, 17, 21, 23]."

      (3) There are a few typos and word repetitions.

      Many thanks – we identified typos and repetitions after another set of careful reads and hope to have eradicated them completely now.

      Reviewer #2 (Recommendations for the authors):

      I am overall very supportive of this work, but addressing the following points may enrich it further:

      (1) Paragraph 2.2.1. Here, models do not seem to compare synesthetes versus controls but rather assess the effects of interest separately in the two groups. The fact that experimental effects are significant in synesthetes, but not in controls, does not tell us much about differences between groups. Controls (e.g., Figure 3) do show a similar trend, albeit clearly smaller. There is one passage in which this issue appears to be tackled (page 10): "Critically, in an LME ran on synesthetes and controls and using only graphemes and the interaction of group and lightness as predictors, we found lightness to predict pupil size in synesthetes (t = -2.754, p = 0.006), but not controls (t = -1.134, p = 0.257)." But I am not sure that the reported statistics belong to the interaction - they seem to refer to the lightness effect within each group, not the difference.

      This is an important point, power for between-group comparisons is inherently limited for n = 16 per group (while still feasible for overall responses, things become trickier when less trials remain). A simple model of pupil ~ grapheme + group * lightness_scaled + (1 | participant) shows no significant interaction (despite one group showing the effect and the other not showing the effect significantly). The additional negative effect for group is in line with the effort-related effect reported later in the manuscript. Where does this leave us? Based on the lightness responses alone, the group difference can be characterized as a quantitative distinction, but the degree in which it is also a qualitative distinction cannot clearly be determined from current data. We revised the manuscript to make sure that such an interaction is not implied/ point to the absence of the significance of that interaction.

      The sensory nature of synesthetic color is supported by within-synesthete analyses, where coupling strength parametrically modulates the lightness-pupil relationship in a theoretically predicted manner. Importantly, the effort-related findings provide a complementary and statistically robust group comparison: synesthetes and controls performing the identical colorreporting task showed significantly different pupil dilation rates, directly demonstrating that the two groups differ in how they access color information. Together, these two independent pupillometric signatures, one tracking perceptual quality, one tracking effort, converge on the same conclusion and mutually reinforce the interpretation that synesthetic color constitutes genuine sensory phenomenology.

      Author response image 3.

      We now make this more explicit in the manuscript as follows:

      “We found significant modulations of pupil size by the lightness of the grapheme's synesthetic color - sustained and in the to-be-expected time window. Specifically, the pupil constricted more for brighter reported colors, and dilated more for darker reported colors, as predicted (Average pupil size 800-4000ms, t = -3.601, p < 0.001). In an LME ran for synesthetes and controls and using only graphemes and lightness as predictors, we found lightness to predict pupil size in synesthetes (t = 2.844, p = 0.004), but not controls (t = 0.606, p = 0.544). However, when taking group as interacting factor in a joint LME, there was no interaction of lightness and group (t = -0.949 p = 0.342).”

      and

      “For controls a separate model was run, now without the PA score as predictor (not assessed for controls). Neither lightness (t = -0.815, p = 0.415), coupling strength (t = 0.438, p = 0.661), nor their interaction gained significance (t = -1.058, p = 0.290; all for average pupil size between 800 ms and 4000 ms). Critically, we also ran a LME with the three-way interaction of coupling strength, group, and lightness (Wilkinson notation: pupil = grapheme + group + lightness * group + coupling strength * lightness * group + (1 | participant)). This analysis revealed a significant three-way interaction between lightness, coupling strength, and group (F = 3.86, p = .021), indicating that the lightness × coupling strength effect on pupil size was not equivalent across groups. Decomposing this interaction by group, the lightness × coupling strength slope was significant in synesthetes (t = 2.59, p = .010) but not in controls (t=-1.01, p=.311), suggesting that reported lightness and its coupling strength were more consistently related to pupil size in synesthetes than in controls. Note however, that this decomposition does not directly test whether the two slopes significantly differ from each other, however. Lastly, pupil size was marginally larger in controls than in synesthetes (t = 1.94, p = .062; see later sections for more in-depth analyses)”

      (2) The authors choose to analyze pupil size in arbitrary eye tracker units. This is fine, although I would recommend assessing and reporting whether the average pupil size (e.g., during the baseline) is roughly comparable between groups. The size of the effects may be difficult to compare between groups in the presence of very different baseline pupil size.

      Please see Author response image 4 for Baseline pupil sizes per group in millimeters. There were no differences between groups.

      Author response image 4.

      F2, 45) = 0.707, p = 0.499 (One-way Anova).

      We now write:

      “Baseline pupil sizes did not differ between groups (F(2, 45) = 0.707, p = 0.499).”

      We agree with the reviewer that millimeters are a more intuitive measure and updated all figures throughout manuscript and supplementary materials accordingly. We also briefly added to signal processing that this conversion was applied.

      “Pupillary data were transformed from arbitrary eyelink units to millimeters using a conversion factor obtained with an artificial eye (see Hayes & Petrov, 2016).”

      Hayes, T. R., & Petrov, A. A. (2016). Mapping and correcting the influence of gaze position on pupil size measurements. Behavior research methods, 48(2), 510-527.

      (3) If I understand correctly, the main task counted 120 trials overall (12 per digit). It seems, however, that only 3 and 4 participants remained with at least 50 trials (or 25 per median split by lightness) after preprocessing. This appears to be quite a massive data loss: is there a reason behind it? Please also clarify: the overall percentage of discarded trials; whether the median split by lightness was computed on all responses or only on those of the remaining, valid trials.

      This is an important point for clarification indeed. The exclusion of participants in Figure 3 applies only to that particular visualization, not to the statistical analyses. The linear mixed effects models (LMEs) used all available valid trials from all participants, with no participant-level exclusions. The figure-specific threshold (≥25 trials per median-split bin) was applied purely for display clarity, as plotting participants with very few trials per bin would produce unreliable/noisy and thus visually misleading traces (as we note in the figure caption and point readers to Supplementary Figure 1, which shows the same visualization without any exclusions).

      Since the paradigm required participants to repeat discarded trials until 120 valid trials were collected, all participants thus contributed exactly 120 valid trials to the analyses. There was therefore no data loss at the analysis level for the LME that is central to the claims of the manuscript (albeit more complex to grasp than the t-tests between bins).

      Why were there sometimes so little trials per brightness bin?

      First, participants differed in how dark or bright (synesthetic or forced-report) colors were overall, meaning that differing proportions thereof would fall above or below the 0.5 cutoff that overall, well represented the sample (but not necessarily every single participant). Note that this median split was not performed per individual but across all color reports to allow an apples-to-apples comparison.

      Second, participants often reported colors that differed in Hue and Saturation, but not Lightness. This is in line with synesthetes picking certain colors more often than others, as compared with non-synesthetes (Rouw & Root, 2019; Ward et al., 2025).

      We now include a new Supplementary Figure that visualizes responses on the Hue and Saturation dimensions of HSL space for both synesthetes and controls; fully saturated reports appear on the outer edge. We refer to the supplementary figure in the caption of Figure 2 as follows:

      "See Supplementary Figure 1 for color reports on the hue and saturation axes.”

      Rouw, R., & Root, N. B. (2019). Distinct colours in the ‘synaesthetic colour palette’. Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1787).

      Ward, J., Maciel, S., Rouw, R., Simner, J., & Root, N. (2025). Synaesthesia is linked to differences in music preference and musical sophistication and a distinctive pattern of sound-color associations. Psychology of Music, 53(3), 453-473.

      Minor points:

      (1) "Building on this evidence, we hypothesized that the cross modal color phenomenology in synesthesia can, if truly sensory in nature, could likewise be (...)" -> may need rephrasing (can/could).

      Many thanks, fixed.

      (2) Caption of Figure 1: "Block 2 (synesthetes only): a colored disk and gray central patch, matching the average indicated color per digit, and the number and luminance of pixels of said digit were presented to assess externally triggered light responses." -> I find this sentence a bit hard to follow; perhaps consider rephrasing it.

      Agreed, we rephrased to:

      Block 2 (synesthetes only): a colored disk was presented, colored according to the synesthete's average indicated color for that digit. At its center sat a gray patch matching the luminance and pixel area of the original digit from Block 1, together allowing assessment of externally triggered light responses.

      (3) Figure 2 b: Consider truncating the y-axis to 1 if that improves the visualization.

      We adjusted the axis accordingly and added a bit more detail in the caption for the interpretation of the measure.

      (4) Caption of Figure 3 points to "see Supplementary Figure 1", but it should probably be SF2.

      Many thanks for spotting, all references to supplementary figures have been checked and are corrected now.

      Elvio Blini

      Reviewer #3 (Recommendations for the authors):

      (1) As a minor comment, there are some terms that felt overused in the manuscript. For example, the words "extraordinary" and "exceptional" were used multiple times throughout. I believe I understand the authors to mean them in their descriptive sense (i.e., outside the realm of typical experience), but in context, those words make it seem like they are touting their own experiment as "exceptional" or "extraordinary," which I don't believe was their intention.

      We agree. We removed words such as exceptional and extraordinary when they do not directly refer to the sensation throughout the manuscript (which is indeed how we intended to use it). We hope that this removes unnecessary and convoluting hyperbole.

      (2) It seemed counterintuitive to me that the color consistency score would be reverse-coded. In this case, the scores actually seem to indicate inconsistency, rather than consistency. Perhaps the raw scores can be inverted for a more intuitive interpretation that aligns with the terminology. I understand that they were following a previous publication in their method (Rothen et al., 2013).

      This manner of coding is counter-intuitive indeed. However, there are both logical and practical reasons to this approach. Importantly, this is indeed the standard way of reporting color consistency in synesthesia research (Carmichael et al., 2015; Eagleman et al., 2007; Root et al., 2025; Rothen et al., 2013). The calculation is based on a simple logic; a higher number reflects a larger distance in color space. An additional advantage is the clear and intuitive zero- reference: a score of zero implies choosing the exact same color. Finally, it intuitively reflects the distinction between synesthetes and non-synesthetes; there is by definition little variation across synesthetes (visualized at the bottom of the graph), then a 'cut-off line' (if consistency is used as diagnostic tool), and then the height of the range shows how large the range in consistency is, in that particular sample of non-synesthetes. In a way we therefore inherit a confusing definition/standard, but changing it would lead to new confusion instead. We now specifically clarify this in the caption as follows:

      “Note that higher consistency is reflected in lower color distance, hence lower values [17].”

      Carmichael, D.A., Down, M.P., Shillcock, R.C., Eagleman, D.M., Simner, J., 2015. Validating a standardised test battery for synesthesia: does the synesthesia battery reliably detect synesthesia? Conscious. Cogn. 33, 375–385

      Eagleman, D.M., Kagan, A.D., Nelson, S.S., Sagaram, D., Sarma, A.K., 2007. A standardized test battery for the study of synesthesia. J. Neurosci. Methods 159 (1), 139–145.

      Root, N., Chkhaidze, A., Melero, H., Sidoro -Dorso, A., Volberg, G., Zhang, Y., & Rouw, R. (2025). How “diagnostic” criteria interact to shape synesthetic behavior: The role of self-report and test–retest consistency in synesthesia research. Consciousness and Cognition, 129, 103819.

      Rothen, N., Seth, A.K., Witzel, C., Ward, J., 2013. Diagnosing synaesthesia with online colour pickers: maximising sensitivity and specificity. J. Neurosci. Methods 215 (1), 156–160.

    1. eLife Assessment

      This study presents a large, systematically curated catalog of non-canonical open reading frames (ncORFs) in human and mouse through the reanalysis of nearly 400 Ribo-seq datasets using a standardized pipeline; the resulting atlas consolidates ncORF annotations across tissues and provides a valuable resource for investigating non-canonical translation and ORF emergence. The main conclusions are supported by consistent data processing and multiple computational measures of translation and conservation. While the pipeline is transparent and technically robust, some analytical criteria and dataset limitations could be described more explicitly, and several downstream conclusions would benefit from more cautious interpretation, some evolutionary inferences are primarily correlative; dataset heterogeneity, uneven tissue representation, and limited experimental validation also constrain the strength of a subset of the findings. Overall, the evidence is solid, and the resource is likely to be broadly beneficial to the community.

    2. Reviewer #1 (Public review):

      This work compiles a comprehensive atlas of ncORFs across mammalian tissues and cell types, derived from reanalysis of ~400 public ribosome profiling datasets. The authors then evaluate cross-species conservation and functional signatures, proposing that evolutionarily ancient ncORFs tend to have higher translation potential, stronger expression, and closer relationships with canonical coding sequences.

      Strengths:

      In general, the study provides a large-scale and timely resource of annotated ncORFs, which could be broadly useful for the community. The authors collected ~400 public ribosome profiling datasets for annotations of ncORFs, which, to my best knowledge, is the largest collection of data for such purpose. The catalog could facilitate future investigations into ncORF biology and broaden understanding of the coding potential of the "non-coding" genome.

      Weaknesses:

      Based on the ncORF catalog, some of the analyses were not properly done. Some of the results are descriptive.

      (1) Bias and representations of data source. Public ribo-seq datasets are unevenly distributed across tissues and cell lines, raising concerns about heterogeneity and underrepresentation of certain contexts. This may limit the generalizability of the catalog.

      (2) The discussion on modular domains of ncORFs is unclear, and the claim that they may originate via TE-related mechanisms is not well supported. Stronger evidence or clearer reasoning is needed.

      (3) The conservation comparisons are not fully convincing. Figure S7 shows only mild differences between ncORFs and CDS, and statistical significance is not clearly demonstrated. Comparisons with other non-coding RNAs should be added, and overlapping sequences between ncORFs and CDS should be excluded to avoid bias.

      (4) Figure 3 indicates that some ncORFs are subject to evolutionary constraints. This is not surprising. The authors should provide further analyses on more detailed features of these "conserved" ncORFs vs. the "non-conserved" ones. Some pretty informative works have been done in drosophila, worms, mouse, and human. Figure 3 suggests some ncORFs are under evolutionary constraint, but this is not unexpected. More granular analyses contrasting "conserved" versus "non-conserved" ncORFs would be informative. In fact, small ORFs, especially uORFs, have been extensively studied, for their functions and corss-species conservations. The authors should explicitly show what is new here in their analyses.

      (5) Translation levels are reported using RPF counts. However, translation efficiency (normalized by RNA expression) is a more appropriate measure to account for expression heterogeneity.

      (6) The correlation analyses between ncORF translation levels and PhyloCSF are confusing and largely descriptive. These sections need sharper framing and clearer conclusions.

      (7) Public ribo-seq datasets, generated by different research labs, are known for their strong batch effects. Representations of tissues and cells are also very unbalanced. Therefore, the co-translation analysis between ncORFs and canonical CDS is not well controlled. This should be done by referring to a recent large-scale ribo-seq meta-analysis (Nat Biotechnol. 2025. doi: 10.1038/s41587-025-02718-5).

      Comments on revisions:

      The authors have made efforts to address most of the previous concerns, and several points have been clarified or improved in the revision. However, in a number of cases, the responses rely more on acknowledgment and reframing rather than substantive analytical strengthening. Overall, the manuscript is improved, particularly in terms of clarity, transparency, and positioning of claims. I support its publication and look forward to seeing how the field engages with and discusses these claims.

    3. Reviewer #2 (Public review):

      Summary:

      Chang et al. attempted to analyze a large number of ribo-seq datasets through a standardized pipeline, identifying novel non-canonical ORFs and elucidating their evolutionary and expression characteristics.

      Strengths:

      (1) The datasets analyzed by the authors are sufficiently comprehensive, and the use of standardized pipelines ensures excellent analytical consistency.

      (2) Their analyses of ORF evolution and co-expression further deepen our understanding of these ORFs.

      Weaknesses:

      (1) The authors primarily conducted analyses through bioinformatics, lacking sufficient wet-lab experimental evidence.

      (2) Some analytical methods and standards were not clearly presented in the manuscript.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      This work compiles a comprehensive atlas of ncORFs across mammalian tissues and cell types, derived from reanalysis of ~400 public ribosome profiling datasets. The authors then evaluate cross-species conservation and functional signatures, proposing that evolutionarily ancient ncORFs tend to have higher translation potential, stronger expression, and closer relationships with canonical coding sequences.

      Strengths:

      In general, the study provides a large-scale and timely resource of annotated ncORFs, which could be broadly useful for the community. The authors collected ~400 public ribosome profiling datasets for annotations of ncORFs, which, to my best knowledge, is the largest collection of data for such a purpose. The catalog could facilitate future investigations into ncORF biology and broaden understanding of the coding potential of the "non-coding" genome.

      We thank the reviewer for the positive evaluation of our manuscript and for recognizing the significance of our contribution.

      Weaknesses:

      Based on the ncORF catalog, some of the analyses were not properly done. Some of the results are descriptive.

      (1) Bias and representations of the data source. Public ribo-seq datasets are unevenly distributed across tissues and cell lines, raising concerns about heterogeneity and underrepresentation of certain contexts. This may limit the generalizability of the catalog.

      We agree with the reviewer that the uneven distribution of public Ribo-seq datasets across tissues can inevitably introduce bias in the ncORF composition of our catalog. This bias is likely more pronounced in humans due to the narrower tissue coverage. We have addressed this point in the Discussion section of the revised manuscript.

      (2) The discussion on modular domains of ncORFs is unclear, and the claim that they may originate via TErelated mechanisms is not well supported. Stronger evidence or clearer reasoning is needed.

      We thank the reviewer for highlighting this point. We have revised the manuscript to more clearly explain the rationale behind our analysis of ncORF modular domains and have adopted more cautious language regarding their potential transposable element–related origins, limiting interpretations to what is directly supported by the data.

      (3) The conservation comparisons are not fully convincing. Figure S7 shows only mild differences between ncORFs and CDS, and statistical significance is not clearly demonstrated.

      Comparisons with other non-coding RNAs should be added, and overlapping sequences between ncORFs and CDS should be excluded to avoid bias.

      We thank the reviewer for this comment and apologize for the lack of clarity in the original figure. Both CDSs and ncORFs show significant deviation from zero Gnocchi scores (two-sided Wilcoxon signed-rank tests), which is now stated explicitly in the revised legend and text. CDS-overlapping ncORFs were already excluded in the original analysis; this has been clarified to avoid confusion.

      As suggested, we have added lncRNAs for comparison. ncORFs display modestly higher Gnocchi scores than lncRNAs, and this difference persists when restricting the analysis to lncRNA-derived ncORFs and their corresponding full-length lncRNAs (see revised Fig. S7). These additions strengthen the conservation comparison while controlling for transcript context.

      (4) Figure 3 indicates that some ncORFs are subject to evolutionary constraints. This is not surprising. The authors should provide further analyses on more detailed features of these "conserved" ncORFs vs. the "non-conserved" ones. Some pretty informative works have been done in Drosophila, worms, mice, and humans. Figure 3 suggests some ncORFs are under evolutionary constraint, but this is not unexpected. More granular analyses contrasting "conserved" versus "non-conserved" ncORFs would be informative. In fact, small ORFs, especially uORFs, have been extensively studied for their functions and cross-species conservation. The authors should explicitly show what is new here in their analyses.

      We thank the reviewer for this insightful comment. We agree that cross-species conservation of ncORFs (particularly uORFs) has been extensively investigated in prior studies, including our own.

      However, most prior analyses have focused on conservation of start codons or overall ORF integrity, which does not distinguish selection acting on translational activity from selection acting on the encoded peptide sequence itself. In contrast, our analysis leverages codon-level periodic PhyloP signals across the full ORF. The observed three-nucleotide periodicity is consistent with selective constraint at the amino acid level, rather than merely preservation of initiation sites or translational potential. Furthermore, our newly developed branch-length statistic uncovers lineage-restricted conservation patterns among ncORFs, enabling resolution of evolutionary dynamics not captured by conventional conservation metrics.

      Thus, while the existence of conserved ncORFs is not unexpected, the conceptual advance of our study lies in demonstrating that a subset exhibits coding-like evolutionary constraint consistent with selection on their peptide products, as well as revealing lineage-specific conservation patterns. We have clarified this distinction in the revised Discussion.

      (5) Translation levels are reported using RPF counts. However, translation efficiency (normalized by RNA expression) is a more appropriate measure to account for expression heterogeneity.

      We agree that translation efficiency (TE), which normalizes ribosome footprint counts by RNA abundance, is in principle an appropriate metric. We initially calculated TE and compared ncORFs with CDSs. However, we found that TE estimates for short ncORFs were substantially inflated by RPF enrichment near start and stop codons, leading to unstable and potentially misleading values.

      For CDSs, this bias is commonly addressed by excluding the first and last 10 to 20 codons when quantifying RPF density. This strategy is not feasible for ncORFs because of their short length. We therefore used RPF counts in the final analysis, applying stringent positional filtering. Only RPFs whose P sites fall within the ORF body, excluding start and stop codons, were counted. RPFs overlapping the ORF but with P sites outside the annotated frame, likely derived from adjacent ORFs or initiation or termination pausing, were excluded.

      TE and RPF counts both measure translation but capture different aspects. TE reflects ribosome density relative to transcript abundance, whereas RPF counts quantify overall ribosome engagement. Given the short lengths of ncORFs, count-based quantification provides a more robust and conservative estimate of their translational activity.

      (6) The correlation analyses between ncORF translation levels and PhyloCSF are confusing and largely descriptive. These sections need sharper framing and clearer conclusions.

      We thank the reviewer for this comment. We agree that the original presentation lacked clear framing. The relationship between PhyloCSF scores and mean ncORF translation levels across tissues is influenced by both evolutionary age and tissue specificity. Older ncORFs with higher coding potential tend to exhibit stronger tissue-restricted expression. As a result, their mean translation levels across all tissues appear lower, not because they are weakly translated, but because their translation is concentrated in specific tissues. This point is addressed in the revised manuscript.

      (7) Public ribo-seq datasets, generated by different research labs, are known for their strong batch effects. Representations of tissues and cells are also very unbalanced. Therefore, the co-translation analysis between ncORFs and canonical CDS is not well controlled. This should be done by referring to a recent large-scale ribo-seq meta-analysis (Nat Biotechnol. 2025. doi: 10.1038/s41587-025-02718-5).

      We thank the reviewer for highlighting this important study and for raising concerns regarding batch effects and tissue imbalance in public Ribo-seq datasets. We are aware that public Ribo-seq data generated by different laboratories are subject to substantial batch effects. During the ncORF annotation phase, we applied stringent quality-control criteria to minimize technical variability. For the co-translation analysis, inclusion criteria were relaxed to increase tissue and cell-type coverage. To partially mitigate representation bias, libraries derived from the same tissue or cell type were merged when quantifying ORF translation levels, thereby reducing overrepresentation from heavily sampled contexts.

      Nevertheless, we acknowledge that these measures cannot completely eliminate batch effects or imbalance inherent to public datasets. We agree that co-translation analysis would benefit from uniformly processed, high-quality datasets generated under standardized protocols with balanced tissue representation, representing a valuable direction for future research.

      Reviewer #2 (Public review):

      Summary:

      Chang et al. attempted to analyze a large number of ribo-seq datasets through a standardized pipeline, identifying novel non-canonical ORFs and elucidating their evolutionary and expression characteristics.

      Strengths:

      (1) The datasets analyzed by the authors are sufficiently comprehensive, and the use of standardized pipelines ensures excellent analytical consistency.

      (2) Their analyses of ORF evolution and co-expression further deepen our understanding of these ORFs.

      We thank the reviewer for the positive evaluation of our manuscript. It is encouraging to know that the analytical framework was found to be sound and appropriate.

      Weaknesses:

      (1) The authors primarily conducted analyses through bioinformatics, lacking sufficient wet-lab experimental evidence.

      We thank the reviewer for this comment and acknowledge this limitation. We agree that functional validation through wet-lab experiments would provide important mechanistic insight into individual ncORFs. However, this study was designed as a systematic, genome-wide computational analysis to characterize translated ncORFs across species and tissues. Our objective was to define global patterns of translation, conservation, and structural features using large-scale datasets. Given the breadth and scale of these analyses, experimental validation of specific ncORFs falls beyond the scope of the current study. We have clarified this point in the dicussion and noted that our results provide a framework for future targeted experimental investigation.

      (2) Regarding the evolution of non-canonical ORFs, a considerable amount of prior work already exists. The authors need to further clarify what new insights and discoveries they have made based on the analysis of such a large dataset.

      We thank the reviewer for this suggestion. Similar concerns were also raised by Reviewer #1. In response, we have revised the Discussion to more clearly delineate the conceptual advances enabled by our large-scale dataset.

      Recommendations for the authors:

      Reviewing Editor Comments:

      Several aspects of the downstream analyses would benefit from additional refinement. The heterogeneity and tissue imbalance inherent in public Ribo-seq datasets introduce potential biases in ncORF detection and inferences about co-translation. Given the breadth of the dataset, it would also be informative to quantify how consistently the newly identified ncORFs are detected across samples-distinguishing those observed broadly across tissues, those enriched in specific contexts, and those detected in only a few datasets. Such stratification would help differentiate reproducibly translated ORFs from candidates requiring further validation.

      We thank the editor for the helpful comments. We agree that heterogeneity and tissue imbalance in public Ribo-seq datasets can influence ncORF detection and downstream interpretations. We have added discussion of this limitation in the revised manuscript.

      Detection of ncORF translation depends not only on biological activity but also on sequencing depth and data quality. Although all ncORFs reported here were reproducibly identified by multiple methods across independent libraries, we agree that those detected in a larger number of datasets represent stronger candidates for functional validation. Accordingly, we now report the number of methods and libraries in which each ncORF was detected in the final catalog (Supplementary Table 3). Overall, 22.3–26.3% of ncORFs were detected in more than 10 libraries, whereas more than half were observed in only two to five libraries (Fig. S1B), enabling clearer stratification of broadly translated versus more context-specific candidates.

      Some evolutionary and functional interpretations are largely descriptive or consistent with established findings for small ORFs, and the authors should more clearly articulate what is novel in their analyses. The criteria separating "young," "old," and "ancient" ORFs require clearer definition, and conservation analyses would be strengthened by improved statistical rigor and explicit exclusion of regions overlapping annotated coding sequences. Evidence for modular domain features or transposable element-related origins is limited and warrants either stronger support or more cautious framing. Proteomics validation is currently minimal and could be substantially reinforced using existing public MS resources.

      We thank the reviewer for these constructive comments. In the revised manuscript, we more clearly delineate the novel insights derived from our evolutionary analyses of ncORFs, distinguishing them from established findings on small ORFs.

      We have clarified the criteria used to classify ORFs by evolutionary age in figure 6E and refined the terminology describing “young,” “old,” and “ancient” categories to ensure precise definition. The conservation analyses have been strengthened through more rigorous statistical treatment and by explicitly excluding regions overlapping annotated coding sequences.

      With respect to modular domain features and potential transposable element–related origins, we have adopted more cautious language and limited our interpretations to what is directly supported by the data. Finally, we acknowledge that current proteomic validation remains limited and have clarified this point in the manuscript while outlining the potential for future integration of large-scale public mass spectrometry datasets in Discussion.

      The authors additionally report an interesting observation that many ncORFs on mRNA co-translate with the main CDS of the same gene. Because canonical models often posit that uORF translation suppresses downstream CDS translation, further analysis would be valuable. In particular, it would be useful to determine whether patterns of co-translation differ among ORF types or evolutionary categories and to discuss possible regulatory mechanisms underlying these relationships.

      We thank the editor for this thoughtful comment. As noted in our response to Reviewer #2, uORF–CDS co-translation does not contradict the canonical model in which uORFs repress downstream CDS translation. Co-translation reflects concurrent ribosome occupancy, whereas repression concerns the fraction of initiating ribosomes that ultimately reach and translate the CDS. Following the editor’s suggestion, we further examined whether co-translation patterns differ across ORF types or evolutionary categories. We found that ncORFs co-translating with their corresponding main CDSs are predominantly uORFs. However, these uORFs do not show statistically significant differences in conservation metrics or evolutionary age compared with other non-overlapping uORFs. Thus, we did not detect clear subtype- or age-specific distinctions among co-translating ncORFs. We have clarified these analyses in the revised manuscript.

      Addressing these points would enhance the precision, interpretability, and robustness of the study's conclusions.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors developed and refined a standardized pipeline to analyze nearly 400 ribo-seq datasets, identifying over 10,000 novel non-canonical ORFs in both human and mouse samples. Given the scale of this analysis, it is intriguing to consider how many of the newly identified non-canonical ORFs are consistently detected across multiple sample types (conservatively expressed ORFs), how many are restricted to specific tissues/ or tissue-specific ORFs), and how many were detected in only a single or very few samples (ORFs requiring further validation). Providing these data could offer new insights into understanding ORF translation.

      Thanks for this constructive suggestion. This information has been presented in the revised Supplementary Table 3 and in a newly added supplementary figure (Fig. S1B), which together provide a clearer overview of ncORF detection consistency and context specificity.

      (2) The authors' validation of MS data lacks specific details in the paper. Regarding the MS-supported ORF mentioned in Lane 117, which dataset's MS data is being referenced? Or does it refer to the content in Reference 20? At present, substantial research exists in both public general proteomics studies (e.g., CPTAC) and MS investigations targeting non-canonical ORFs. We recommend the authors incorporate additional MS data or public MS-based databases to strengthen validation in this area (PMID: 34129944, 39794466, 37823596,39413795).

      We thank the reviewer for this comment and for the helpful suggestions. The MS-supported ORFs mentioned in line 117 refer to the compilation reported in Reference 20, which integrates evidence from multiple independent proteomics studies. In addition, we examined MS-supported ORFs curated by GENCODE and PeptideAtlas, which are shown in Fig. 1E.

      We agree that incorporating additional MS datasets would further strengthen validation of ncORFs. Studies cited by the reviewer and recent community efforts such as the GENCODE and PeptideAtlas analyses (PMID: 39314370) provide valuable examples in this direction. However, performing a comprehensive reanalysis of more than 95,000 public human MS runs is computationally demanding and currently infeasible for our group given resource and funding constraints.

      To our knowledge, ongoing community-wide initiatives are working toward more comprehensive catalogs of translated human ncORFs. Large-scale, exhaustive MS searches will be particularly effective once a community consensus annotation framework for ncORFs is established. We have added discussion of these limitations and future directions in the revised manuscript.

      (3) The authors classified ncORFs into three groups-"Ancient," "Young," and "Old"-based on their origin nodes. However, both the "Young" and 'Old' groups appear to be "mammalian-specific," yet the specific criteria for their division remain unclear. It is recommended to more clearly define in the figure legend or main text how "Young" and "Old" are categorized (e.g., based on specific evolutionary nodes or distance thresholds from nodes to the end) to avoid reader confusion.

      In Fig. 5, “old” and “young” were intended as qualitative descriptors of relative evolutionary age based on the position of ncORF origination nodes along the phylogeny, as indicated on the x-axis. They were not meant to represent discrete categories. To avoid confusion, we have revised the manuscript to use “older” and “younger” throughout when referring to relative age differences. A binary classification is used only in Fig. 6E, where ncORFs are grouped into ancient (pre-mammalian) and younger (mammalian-specific) categories. This distinction is clearly defined in both the main text and the corresponding figure legend.

      (4) The authors observed an intriguing phenomenon: ncORFs on mRNA tend to co-translate with the main CDS of the same gene. However, the conventional view holds that uORF translation often inhibits the translation of the main CDS. I suggest the authors could refine their analysis in this section further. For instance, do different types of ORFs or ORFs at different evolutionary levels exhibit distinct levels of cotranslation with the main CDS? Additionally, while observing this phenomenon, the authors should also propose hypotheses regarding the regulatory mechanisms involved in these processes.

      We thank the reviewer for these constructive suggestions. After excluding CDS-overlapping ORFs, we identified 258 human and 128 mouse ncORFs that co-translate with their corresponding main CDSs. With the exception of 10 human dORFs, all remaining cases were uORFs. We compared these cotranslating ncORFs with other non-overlapping uORFs and dORFs but did not detect statistically significant differences in evolutionary age and conservation metrics. Because no clear distinguishing features emerged, we did not include these results in the manuscript.

      Importantly, the observation of uORF–CDS co-translation does not contradict the established repressive role of uORFs. Co-translation reflects concurrent ribosome occupancy, whereas repression concerns the proportion of initiating ribosomes that ultimately translate the CDS. For example, if two ribosomes initiate within a given interval and one translates the uORF while one translates the CDS, CDS output is reduced by 50% relative to a uORF-free transcript. If four ribosomes initiate under the same repressive regime, two may translate the uORF and two the CDS. In this case, absolute translation of both ORFs increases, while the fractional repression remains unchanged. Thus, co-translation is compatible with a regulatory model in which uORFs reduce CDS translation efficiency without abolishing it. This has been clarified in the revised manuscript.

    1. eLife Assessment

      This study offers an important contribution to our understanding of the role of layer 6b cortical neurons in sleep-wake regulation, providing new insight into how this understudied neural population may regulate cortical arousal via orexin signaling. The evidence supporting these findings is solid, although somewhat constrained by limitations in the specificity of the genetic targeting strategy. Nonetheless, the work introduces new avenues for uncovering how the classical wake-promoting peptide, orexin, exerts its effects on the cortex.

    2. Reviewer #1 (Public review):

      Summary:

      Meijer et al. sought to investigate the role of cortical layer 6b (L6b) neurons in modulating sleep-wake states and cortical oscillations under baseline and sleep deprived conditions and in response to orexin A and B. Using chronic EEG recordings in mice with silencing of Drd1a+ neurons (via constitutive Cre-dependent knockout of SNAP25), the authors report that while overall baseline sleep-wake architecture and response to sleep deprivation are minimal/unchanged, "L6b silencing leads" to a slowing of theta activity during wakefulness and REM sleep, and a reduction in EEG power during NREM sleep. The manuscript is well written with clarity and transparency. Although Drd1a+ neurons are not exclusive to L6b, the authors describe key future studies to identify a causal role for L6b neurons in brain state regulation. These studies contribute to a growing body of evidence that cortex-in addition to subcortical brain regions-plays a role in brain state regulation.

      Strengths:

      (1) The text is well written.

      (2) The authors are transparent about methodological details and study limitations.

      (3) The stated sleep, circadian, and orexin infusion experiments are well designed, executed, and analyzed.

      Weaknesses:

      (1) Outcomes are attributed to silencing cortical L6b neurons, but the genetic manipulation is not specific to L6b neurons or cortex. The authors acknowledge this as a limitation and offer targets for future studies to identify L6b neuron-specific contributions to stated outcomes that include spatially restricted manipulations.

      (2) Experiments use only male mice, which limits generalizability to females.

      Comments on revised version:

      The authors took great care in addressing my previous comments, and I do not have any additional concerns.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Meijer and colleagues investigated the effects of inactivation (conditional silencing) of cortical layer 6b neurons on sleep-wake states and EEG spectral power under the following three conditions: during natural sleep-wake states, after sleep deprivation, or after intracerebroventricular administration of orexin A and B. The authors report that silencing of L6b neurons did not have a significant effect on the total time spent in sleep-wake states, duration or number of state epochs, or the response to sleep deprivation. However, silencing of L6b neurons did slow down theta-frequency (6-9 Hz) during wake and REM sleep, and reduced the total EEG power during NREM sleep. Infusion of orexin A in the mice in which cortical layer 6b neurons were inactivated produced an increase in wakefulness. A similar effect was observed after infusion of orexin A in the mice in which these neurons were not silenced, but the effect (i.e., increase in wakefulness) was of a smaller magnitude. Silencing of cortical layer 6b neurons attenuated the effect of orexin B in increasing theta activity, as was observed in the control mice. The authors conclude that the cortical neurons in layer 6b play an essential role in state-dependent dynamics of brain activity, vigilance state control and sleep regulation.

      Strengths:

      - A focus on cortical layer 6b neurons, which is an understudied neuronal population, especially in the context of brain and behavioral state transitions.

      - The authors used a well-established mouse model to study the effect of inactivation of cortical layer 6b neurons.

      Weaknesses:

      - Although the authors used a highly selective approach to silence layer 6b neurons, the observed changes in EEG oscillations cannot be solely attributed to layer 6b neurons because of the ICV route for orexin administration.

      - The rationale for using only male rats is not provided.

      Comments on revised version:

      The authors have addressed my concerns.