eLife Assessment
This is a valuable study of the physiological mechanisms promoting network activity during fever in the mouse neocortex. The supporting evidence is solid, and has improved with revision, along with increased clarity of presentation.
eLife Assessment
This is a valuable study of the physiological mechanisms promoting network activity during fever in the mouse neocortex. The supporting evidence is solid, and has improved with revision, along with increased clarity of presentation.
Reviewer #1 (Public review):
The paper by Chen et al describes the role of neuronal themo-TRPV3 channels in the firing of cortical neurons at fever temperature range. The authors began by demonstrating that exposure to infrared light increasing ambient temperature causes body temperature rise to fever level above 38{degree sign}C. Subsequently, they showed that at the fever temperature of 39{degree sign}C, the increased spike threshold (ST) increased in both populations (P12-14 and P7-8) of cortical excitatory pyramidal neurons (PNs). However, the spike number only decreased in P7-8 PNs, while it remained stable in P12-14 PNs at 39{degree sign}C. In addition, the fever temperature also reduced the late peak postsynaptic potential (PSP) in P12-14 PNs. The authors further characterized the firing properties of cortical P12-14 PNs, identifying two types: STAY PNs that retained spiking at 30{degree sign}C, 36{degree sign}C and 39{degree sign}C, and STOP PNs that stopped spiking upon temperature change. They further extended their and analysis and characterization to striatal medium spiny neurons (MSNs) and found that STAY MSNs and PNs shared same ST temperature sensitivity. Using small molecule tools, they further identified that themo-TRPV3 currents in cortical PNs increased in response to temperature elevation, but not TRPV4 currents. The authors concluded that during fever, neuronal firing stability is largely maintained by sensory STAY PNs and MSNs that express functional TRPV3 channels. Overall, this study is well designed and executed with substantial controls, some interesting findings and quality of data.
Comments on revisions:
My previous concerns have been addressed in this revised manuscript.
Reviewer #2 (Public review):
Summary:
The authors studied the excitability of layer 2/3 pyramidal neurons in response to layer four stimulation at temperatures ranging from 30 to 39{degree sign}C in P7-8, P12-P14, and P22-P24 animals. They also measure brain temperature and spiking in vivo in response to externally applied heat. Some pyramidal neurons continue to fire action potentials in response to stimulation at 39{degree sign}C and are referred to as "stay neurons." Stay neurons have unique properties, aided by the expression of the TRPV3 channel.
Strengths:
The authors focused on layer 2/3 neuronal excitability at three developmental stages: during the window of susceptibility to febrile seizures, before the window opens, and after it closes.
Electrophysiological experiments are rigorously performed and carefully interpreted.
The cellular electrophysiology is further confirmed. The authors compared the seizure susceptibility of TRPV3 knockout, heterozygous, and wild-type mice. EEG recording would have strengthened the study, but they are challenging in this age group.
Finally, the authors studied TRPV3 expression with immunohistochemistry.
Reviewer #3 (Public review):
Summary:
This important study combines in vitro and in vivo recording to determine how the firing of cortical and striatal neurons changes during a fever range temperature rise (37-40 oC). The authors found that certain neurons will start, stop, or maintain firing during these body temperature changes. The authors further suggested that the TRPV3 channel plays a role in maintaining cortical activity during fever.
Strengths:
The topic of how the firing pattern of neurons changes during fever is unique and interesting. The authors carefully used in vitro electrophysiology assays to study this interesting topic.
Weaknesses:
(1) In vivo recording is a strength of this study. However, data from in vivo recording is only shown in Fig 5A,B. This reviewer suggests the authors further expand on the analysis of the in vivo Neuropixels recording. For example, to show single spike waveforms and raster plots to provide more information on the recording. The authors can also separate the recording based on brain regions (cortex vs striatum) using the depth of the probe as a landmark to study the specific firing of cortical neurons and striatal neurons. It is also possible to use published parameters to separate the recording based on spike waveform to identify regular principal neurons vs fast-spiking interneurons. Since the authors studied E/I balance in brain slices, it would be very interesting to see whether the "E/I balance" based on the firing of excitatory neurons vs fast-spiking interneurons might be changed or not in the in vivo condition.
(2) The author should propose a potential mechanism for how TRPV3 helps to maintain cortical activity during fever. Would calcium influx-mediated change of membrane potential be the possible reason? Making a summary figure to put all the findings into perspective and propose a possible mechanism would also be appreciated.
(3) The author studied P7-8, P12-14, and P20-26 mice. How do these ages correspond to the human ages? it would be nice to provide a comparison to help the reader understand the context better.
Comments on revisions:
In this revised version, the authors nicely addressed my critiques. I have no more comments to make.
Author response:
The following is the authors’ response to the original reviews
Public Reviews:
Reviewer #1 (Public review):
The paper by Chen et al describes the role of neuronal themo-TRPV3 channels in the firing of cortical neurons at a fever temperature range. The authors began by demonstrating that exposure to infrared light increasing ambient temperature causes body temperature to rise to a fever level above 38{degree sign}C. Subsequently, they showed that at the fever temperature of 39{degree sign}C, the spike threshold (ST) increased in both populations (P12-14 and P7-8) of cortical excitatory pyramidal neurons (PNs). However, the spike number only decreased in P7-8 PNs, while it remained stable in P12-14 PNs at 39 degrees centigrade. In addition, the fever temperature also reduced the late peak postsynaptic potential (PSP) in P12-14 PNs. The authors further characterized the firing properties of cortical P12-14 PNs, identifying two types: STAY PNs that retained spiking at 30{degree sign}C, 36{degree sign}C, and 39{degree sign}C, and STOP PNs that stopped spiking upon temperature change. They further extended their analysis and characterization to striatal medium spiny neurons (MSNs) and found that STAY MSNs and PNs shared the same ST temperature sensitivity. Using small molecule tools, they further identified that themo-TRPV3 currents in cortical PNs increased in response to temperature elevation, but not TRPV4 currents. The authors concluded that during fever, neuronal firing stability is largely maintained by sensory STAY PNs and MSNs that express functional TRPV3 channels. Overall, this study is well designed and executed with substantial controls, some interesting findings, and quality of data. Here are some specific comments:
(1) Could the authors discuss, or is there any evidence of, changes in TRPV3 expression levels in the brain during the postnatal 1-4 week age range in mice?
This is an excellent question. To our knowledge, no published studies have documented changes in TRPV3 expression in the mouse brain during the first to fourth postnatal weeks. Research on TRPV3 expression has primarily relied on RT-PCR analysis of RNA from dissociated adult brain tissue (Jang et al., 2012; Kumar et al., 2018), largely due to the limited availability of effective antibodies for brain sections at the time. Furthermore, the Allen Brain Atlas does not provide data on TRPV3 expression in the developing or postnatal brain. To address this gap, we performed immunohistochemistry to examine TRPV3 expression at P7,
P14, and P21 (Figure 7). To confirm specificity, the TRPV3 antibody was co-incubated with a TRPV3 blocker (Figure 7A, top row, right panel). While immunohistochemistry is semiquantitative, we observed a trend toward increased TRPV3 expression in the cortex, striatum, hippocampus, and thalamus from P7 to P14.
(2) Are there any differential differences in TRPV3 expression patterns that could explain the different firing properties in response to fever temperature between the STAY- and STOP neurons?
This is another excellent question, and we plan to explore it in the future by developing reporter mice for TRPV3 expression and viral tools that leverage endogenous TRPV3 promoters to drive a fluorescent protein, enabling monitoring of cells with native TRPV3 expression. To our knowledge, such tools do not currently exist. Creating them will be challenging, as it requires identifying promoters that accurately reflect endogenous TRPV3 expression.
We have not yet quantified TRPV3 expression in STOP and STAY neurons. However, our analysis of evoked spiking at 30, 36, and 39 °C suggests that TRPV3 may mark a population of cortical pyramidal neurons that tend to remain active (“STAY”) as temperatures increase. While we have not directly compared TRPV3 expression between STAY and STOP neurons at feverrange temperatures, intracellular blockade of TRPV3 with forsythoside B (50 µM) significantly reduced the proportion of STAY neurons (Figure 9B). Consistently, spiking was also significantly reduced in Trpv3⁻/⁻ mice (Figure 10D).
In our immunohistochemical analysis, TRPV3 was detected in L4 barrels and in L2/3, where we observed a patchy distribution with some regions showing more intense staining (Figure 7B). It is possible that cells with higher TRPV3 levels correspond to STAY neurons, while those with lower levels correspond to STOP neurons. As we develop tools to monitor activity based on endogenous TRPV3 levels, we anticipate gaining deeper insight into this relationship.
(3) TRPV3 and TRPV4 can co-assemble to form heterotetrameric channels with distinct functional properties. Do STOP neurons exhibit any firing behaviors that could be attributed to the variable TRPV3/4 assembly ratio?
There is some evidence that TRPV3 and TRPV4 proteins can physically associate in HEK293 cells and native skin tissues (Hu et al., 2022).TRPV3 and TRPV4 are both expressed in the cortex (Kumar et al., 2018), but it remains unclear whether they are co-expressed and coassembled to form heteromeric channels in cortical excitatory pyramidal neurons. Examination of the I-V curve from HEK cells co-expressing TRPV3/4 heteromeric channels shows enhanced current at negative membrane potentials (Hu et al., 2022).
Currently, we cannot characterize cells as STOP or STAY and measure TRPV3 or TRPV4 currents simultaneously, as this would require different experimental setups and internal solutions. Additionally, the protocol involves a sequence of recordings at 30, 36, and 39°C, followed by cooling back to 30°C and re-heating to each temperature. Cells undergoing such a protocol will likely not survive till the end.
In our recordings of TRPV3 currents, which likely include both STOP and STAY cells, we do not observe a significant current at negative voltages, suggesting that TRPV3/4 heteromeric channels may either be absent or underrepresented, at least at a 1:1 ratio. However, the possibility that TRPV3/4 heteromeric channels could define the STOP cell population is intriguing and plausible.
(4) In Figure 7, have the authors observed an increase of TRPV3 currents in MSNs in response to temperature elevation?
We have not recorded TRPV3 currents in MSNs in response to elevated temperatures. Please note that the handling editor gave us the option to remove these data from the paper, and we elected to do so to develop them as a separate manuscript.
(5) Is there any evidence of a relationship between TRPV3 expression levels in D2+ MSNs and degeneration of dopamine-producing neurons?
This is an interesting question, though it falls outside our current research focus in the lab. A PubMed search yields no results connecting the terms TRPV3, MSNs, and degeneration. However, gain-of-function mutations in TRPV4 channel activity have been implicated in motor neuron degeneration (Sullivan et al., 2024) and axon degeneration (Woolums et al., 2020). Similarly, TRPV1 activation has been linked to developmental axon degeneration (Johnstone et al., 2019), while TRPV3 blockade has shown neuroprotective effects in models of cerebral ischemia/reperfusion injury in mice (Chen et al., 2022).
The link between TRPV activation and cell degeneration, however, may not be straightforward. For instance, TRPV1 loss has been shown to accelerate stress-induced degradation of axonal transport from retinal ganglion cells to the superior colliculus and to cause degeneration of axons in the optic nerve (Ward et al., 2014). Meanwhile, TRPV1 activation by capsaicin preserves the survival and function of nigrostriatal dopamine neurons in the MPTP mouse model of Parkinson's disease (Chung et al., 2017).
(6) Does fever range temperature alter the expressions of other neuronal Kv channels known to regulate the firing threshold?
This is an active line of investigation in our lab. The results of ongoing experiments will provide further insight into this question.
Reviewer #2 (Public review):
Summary:
The authors study the excitability of layer 2/3 pyramidal neurons in response to layer four stimulation at temperatures ranging from 30 to 39 Celsius in P7-8, P12-P14, and P22-P24 animals. They also measure brain temperature and spiking in vivo in response to externally applied heat. Some pyramidal neurons continue to fire action potentials in response to stimulation at 39 C and are called stay neurons. Stay neurons have unique properties aided by TRPV3 channel expression.
Strengths:
The authors use various techniques and assemble large amounts of data.
Weaknesses:
(1) No hyperthermia-induced seizures were recorded in the study.
The goal of this manuscript is to uncover age-related physiological changes that enable the brain to maintain function at fever-range temperatures, typically 38–40°C. Febrile seizures in humans are also typically induced within this temperature range. Given this context, we initially did not examine hyperthermia-induced seizures. However, as requested, we assessed the effects of reduced Trpv3 expression on hyperthermia-induced seizures in WT(Trpv3<sup>+/+</sup>), heterozygous (Trpv3<sup>+/-</sup>), and homozygous knockout (Trpv3<sup>-/-</sup>) P12 pups. Please see figure 10.
While T<sub>b</sub> at seizure onset and the rate of T<sub>b</sub> increase leading to seizure were not significantly different among genotypes, the time to seizure from the point of loss of postural control (LPC), defined as collapse and failure to maintain upright posture, was significantly longer in Trpv3<sup>+/-</sup> and Trpv3<sup>-/-</sup> mice. Together, these results indicate that reduced TRPV3 function enhances resistance to seizure initiation and/or propagation under febrile conditions, likely by decreasing neuronal depolarization and excitability.
(2) Febrile seizures in humans are age-specific, extending from 6 months to 6 years. While translating to rodents is challenging, according to published literature (see Baram), rodents aged P11-16 experience seizures upon exposure to hyperthermia. The rationale for publishing data on P7-8 and P22-24 animals, which are outside this age window, must be clearly explained to address a potential weakness in the study.
As requested, we have added an explanation in the “Introduction” for our rationale in including age ranges that flank the period of susceptibility to hyperthermia-induced seizures (see lines 80–100). In summary, we emphasize that this design provides negative controls, allowing us to determine whether the changes observed in the P12–14 window are specific to this developmental period.
(3) Authors evoked responses from layer 4 and recorded postsynaptic potentials, which then caused action potentials in layer 2/3 neurons in the current clamp. The post-synaptic potentials are exquisitely temperature-sensitive, as the authors demonstrate in Figures 3 B and 7D. Note markedly altered decay of synaptic potentials with rising temperature in these traces. The altered decays will likely change the activation and inactivation of voltage-gated ion channels, adjusting the action potential threshold.
The activation and inactivation of voltage-gated ion channels can modulate action potential threshold. Indeed, we have identified channels that contribute to the temperature-induced increase in spike threshold, including BK channels and Scn2a. However, Figure 4B represents a cell with no inhibition at 39°C, and thus the observed loss of the late postsynaptic potential (PSP). This primarily contributes to the prolonged decay of the synaptic potentials. By contrast, cells in which inhibition is retained, when exposed to the same thermal protocol, do not exhibit such extended decay.
(4) The data weakly supports the claim that the E-I balance is unchanged at higher temperatures. Synaptic transmission is exquisitely temperature-sensitive due to the many proteins and enzymes involved. A comprehensive analysis of spontaneous synaptic current amplitude, decay, and frequency is crucial to fully understand the effects of temperature on synaptic transmission.
We did not intend to imply that E-I balance is generally unchanged at higher temperatures. Our statements specifically referred to observations in experiments conducted during the P20–26 age range in cortical pyramidal neurons. We are conducting a parallel line of investigation examining the differential susceptibility of E-I balance across age and temperature, and we have observed age- and temperature-dependent effects. Recognizing that our earlier wording may have been misleading, we have removed this statement from the manuscript.
(5) It is unclear how the temperature sensitivity of medium spiny neurons is relevant to febrile seizures. Furthermore, the most relevant neurons are hippocampal neurons since the best evidence from human and rodent studies is that febrile seizures involve the hippocampus.
Thank you for the opportunity to provide clarification. The goal of this manuscript is to uncover age-related physiological changes that enable the brain to maintain stable, non-excessive neuronal firing at fever-range temperatures (typically 38–40°C). We hypothesize that these changes are a normal part of brain development, potentially explaining why most children do not experience febrile seizures. By understanding these mechanisms, we may identify points in the process that are susceptible to dysfunction, due to genetic mutations, developmental delays, or environmental factors, which could provide insight into the rare cases when seizures occur between 2–5 years of age.
Our aim was not to establish a link between medium spiny neuron (MSN) function and febrile seizures. MSNs were included in this study as a mechanistic comparison because they represent a non-pyramidal, non-excitatory neuronal subtype, allowing us to assess whether the physiological changes observed in L2/3 excitatory pyramidal neurons are unique to these cells. Please note that the handling editor gave us the option to remove these data from the manuscript, and we chose to do so, developing these findings into a separate manuscript.
(6) TRP3V3 data would be convincing if the knockout animals did not have febrile seizures.
We find that approximately equal numbers of excitatory neurons either start or stop firing at fever-range temperatures (typically 38–40 °C). Neurons that continue to fire (“STAY” cells), thus play a key role in maintaining stable, non-excessive network activity. While future studies will examine the mechanisms driving some neurons to initiate spiking, our findings suggest that a reduction in the number of STAY cells could influence more subtle aspects of seizure dynamics, such as time to onset, by decreasing overall network excitability. We assessed the effects of reduced Trpv3 expression on hyperthermia-induced seizures in WT(Trpv3<sup>+/+</sup>), heterozygous (Trpv3<sup>+/-</sup>), and homozygous knockout (Trpv3<sup>-/-</sup>) P12 pups. As you stated, these mice have hyperthermic seizures, however, we noted that the time to seizure from the point of loss of postural control (LPC), defined as collapse and failure to maintain upright posture, was significantly longer in Trpv3<sup>+/-</sup> and Trpv3<sup>-/-</sup> mice. Normally, seizures happen shortly after this point, but notably, Trpv3<sup>-/-</sup> mice took twice as long to reach seizure onset compared with wildtype mice. In an epileptic patient, this increased time may be sufficient for a caretaker to move the patient to a safer location, reducing the risk of injury during the seizure.
Consistent with findings that TRPV3 blockade using 50 µM forsythoside B reduces spiking in cortical L2/3 pyramidal neurons, we observed significantly reduced spiking in Trpv3<sup>-/-</sup> mice as well (Figure 10D). Analysis of postsynaptic potentials in these neurons showed that, in WT mice, PSP amplitude increased with temperature elevation into the febrile range, whereas this temperature-dependent depolarization was absent in Trpv3<sup>-/-</sup> mice (Figure 10E). Together, these results indicate that reduced TRPV3 function enhances resistance to seizure initiation and/or propagation under febrile conditions, likely by decreasing neuronal depolarization and excitability.
Reviewer #3 (Public review):
Summary:
This important study combines in vitro and in vivo recording to determine how the firing of cortical and striatal neurons changes during a fever range temperature rise (37-40 oC). The authors found that certain neurons will start, stop, or maintain firing during these body temperature changes. The authors further suggested that the TRPV3 channel plays a role in maintaining cortical activity during fever.
Strengths:
The topic of how the firing pattern of neurons changes during fever is unique and interesting. The authors carefully used in vitro electrophysiology assays to study this interesting topic.
Weaknesses:
(1) In vivo recording is a strength of this study. However, data from in vivo recording is only shown in Figures 5A,B. This reviewer suggests the authors further expand on the analysis of the in vivo Neuropixels recording. For example, to show single spike waveforms and raster plots to provide more information on the recording. The authors can also separate the recording based on brain regions (cortex vs striatum) using the depth of the probe as a landmark to study the specific firing of cortical neurons and striatal neurons. It is also possible to use published parameters to separate the recording based on spike waveform to identify regular principal neurons vs fast-spiking interneurons. Since the authors studied E/I balance in brain slices, it would be very interesting to see whether the "E/I balance" based on the firing of excitatory neurons vs fast-spiking interneurons might be changed or not in the in vivo condition.
As requested, we have included additional analyses and figures related to the in vivo recording experiments in Figure 5. Specifically, we added examples of multiunit and single-spike waveforms, as well as autocorrelation histograms (ACHs). ACHs were used because raster plots of individual single units would not be very informative given the long recording period. Additionally, Figure 5F was also aimed to replace raster plots as it helps to track changes in the firing rate of a single neurons over time.
Additionally, all recordings were conducted in the cortex at a depth of ~1 mm from the surface, and no recordings were performed in the striatum. Based on the reviewing editor’s suggestions, we decided to remove the striatal data from the manuscript and develop this aspect of the project for a separate publication.
Lastly, we used published parameters to classify recordings based on spike waveform into putative regular principal neurons and interneurons. To clarify this point, we have now included descriptions that were previously listed only in the “Methods” section into the “Results” section as well.
The paragraph below from the methods section describes this procedure.
“Following manual curation, based on their spike waveform duration, the selected single units (n= 633) were separated into putative inhibitory interneurons and excitatory principal cells (Barthóet al., 2004). The spike duration was calculated as the time difference between the trough and the subsequent waveform peak of the mean filtered (300 – 6000 Hz bandpassed) spike waveform. Durations of extracellularly recorded spikes showed a bimodal distribution (Hartigan’s dip test; p < 0.001) characteristic of the neocortex with shorter durations corresponding to putative interneurons (narrow spikes) and longer durations to putative principal cells (wide spikes). Next, k-means clustering was used to separate the single units into these two groups, which resulted in 140 interneurons (spike duration < 0.6 ms) and 493 principal cells (spike duration > 0.6 ms), corresponding to a typical 22% - 78% (interneuron – principal) cell ratio”.
As suggested, we calculated the E/I balance using the average firing rates of excitatory and inhibitory neurons in the in vivo condition. Our analysis revealed that the E/I balance remained unchanged (see Author response image 1). Nonetheless, following the option provided by the reviewing editor, we have chosen to remove the statement referencing E/I balance from the manuscript.
Author response image 1.
(2) The author should propose a potential mechanism for how TRPV3 helps to maintain cortical activity during fever. Would calcium influx-mediated change of membrane potential be the possible reason? Making a summary figure to put all the findings into perspective and propose a possible mechanism would also be appreciated.
Thank you for your helpful suggestion. In response, we have included a summary figure (Figure 11) illustrating the hypothesis described in the Discussion section. We agree with your assessment that Trpv3 most likely contributes to maintaining cortical activity during fever by promoting calcium influx and depolarizing the membrane potential.
(3) The author studied P7-8, P12-14, and P20-26 mice. How do these ages correspond to the human ages? it would be nice to provide a comparison to help the reader understand the context better.
Ideally, the mouse to human age comparison should depend on the specific process being studied. Per your suggestion, we have added additional references in the Introduction (Dobbing and Sands, 1973; Baram et al., 1997; Bender et al., 2004) to help readers better understand the correspondence between mouse and human ages.
Recommendations for the authors:
Reviewer #2 (Recommendations for the authors):
(3) Perform I-F curves to study the intrinsic properties of layer 2/3 neurons without the confound of evoked responses.
We performed F-I curve analyses (Figures 2H–I), as suggested by Reviewer 2, to study intrinsic properties of L2/3 neurons without evoked responses. Although rheobase increased at 39 °C compared to 30 °C, consistent with findings such as depolarized spike threshold and reduced input resistance, the mean number of spikes across current steps did not differ.
Reviewer #3 (Recommendations for the authors):
Some statistical descriptions are not clearly stated. For example, what statistical methods were used in Fig 2E? The effect size in Fig 2D seems to be quite small. The authors are advised to consider "nested analysis" to further increase the rigor of the analysis. Does each dot mean one neuron? Some of the data points might not be totally independent. The author should carefully check all figures to make sure the stats methods are provided for each panel.
We apologize for not including statistical details in Figure 2E. We have now added this information and verified that statistical descriptions are provided in all figure legends. In Figure 2D, each dot represents a cell, with measurements taken from the same cell at 30°C, 36°C, and 39°C. Given this design, the appropriate test is a one-way repeated-measures ANOVA.
eLife Assessment
This study provides a useful investigation of human-AI interaction and decision-making, using both behavioral and electrophysiological measures. However, the theoretical framework and experimental design are incomplete, with an unclear task structure and feedback implementation limiting interpretability. With these issues addressed, the work could make a significant contribution to understanding human-AI collaboration.
Reviewer #1 (Public review):
Summary:
In the study by Roeder and colleagues, the authors aim to identify the psychophysiological markers of trust during the evaluation of matching or mismatching AI decision-making. Specifically, they aim to characterize through brain activity how the decision made by an AI can be monitored throughout time in a two-step decision-making task. The objective of this study is to unfold, through continuous brain activity recording, the general information processing sequence while interacting with an artificial agent, and how internal as well as external information interact and modify this processing. Additionally, the authors provide a subset of factors affecting this information processing for both decisions.
Strengths:
The study addresses a wide and important topic of the value attributed to AI decisions and their impact on our own confidence in decision-making. It especially questions some of the factors modulating the dynamical adaptation of trust in AI decisions. Factors such as perceived reliability, type of image, mismatch, or participants' bias toward one response or the other are very relevant to the question in human-AI interactions.
Interestingly, the authors also question the processing of more ambiguous stimuli, with no real ground truth. This gets closer to everyday life situations where people have to make decisions in uncertain environments. Having a better understanding of how those decisions are made is very relevant in many domains.
Also, the method for processing behavioral and especially EEG data is overall very robust and is what is currently recommended for statistical analyses for group studies. Additionally, authors provide complete figures with all robustness evaluation information. The results and statistics are very detailed. This promotes confidence, but also replicability of results.
An additional interesting method aspect is that it is addressing a large window of analysis and the interaction between three timeframes (evidence accumulation pre-decision, decision-making, post-AI decision processing) within the same trials. This type of analysis is quite innovative in the sense that it is not yet a standard in complex experimental designs. It moves forward from classical short-time windows and baseline ERP analysis.
Weaknesses:
This manuscript raises several conceptual and theoretical considerations that are not necessarily answered by the methods (especially the task) used. Even though the authors propose to assess trust dynamics and violations in cooperative human-AI teaming decision-making, I don't believe their task resolves such a question. Indeed, there is no direct link between the human decision and the AI decision. They do not cooperate per se, and the AI decision doesn't seem, from what I understood to have an impact on the participants' decision making. The authors make several assumptions regarding trust, feedback, response expectation, and "classification" (i.e., match vs. mismatch) which seem far stretched when considering the scientific literature on these topics.
Unlike what is done for the data processing, the authors have not managed to take the big picture of the theoretical implications of their results. A big part of this study's interpretation aims to have their results fit into the theoretical box of the neural markers of performance monitoring.
Overall, the analysis method was very robust and well-managed, but the experimental task they have set up does not allow to support their claim. Here, they seem to be assessing the impact of a mismatch between two independent decisions.
Nevertheless, this type of work is very important to various communities. First, it addresses topical concerns associated with the introduction of AI in our daily life and decisions, but it also addresses methodological difficulties that the EEG community has been having to move slowly away from the static event-based short-timeframe analyses onto a more dynamic evaluation of the unfolding of cognitive processes and their interactions. The topic of trust toward AI in cooperative decision making has also been raised by many communities, and understanding the dynamics of trust, as well as the factors modulating it, is of concern to many high-risk environments, or even everyday life contexts. Policy makers are especially interested in this kind of research output.
Reviewer #2 (Public review):
Summary:
The authors investigated how "AI-agent" feedback is perceived in an ambiguous classification task, and categorised the neural responses to this. They asked participants to classify real or fake faces, and presented an AI-agent's feedback afterwards, where the AI-feedback disagreed with the participants' response on a random 25% of trials (called mismatches). Pre-response ERP was sensitive to participants' classification as real or fake, while ERPs after the AI-feedback were sensitive to AI-mismatches, with stronger N2 and P3a&b components. There was an interaction of these effects, with mismatches after a "Fake" response affecting the N2 and those after "Real" responses affecting P3a&b. The ERPs were also sensitive to the participants' response biases, and their subjective ratings of the AI agent's reliability.
Strengths:
The researchers address an interesting question, and extend the AI-feedback paradigm to ambiguous tasks without veridical feedback, which is closer to many real-world tasks. The in-depth analysis of ERPs provides a detailed categorisation of several ERPs, as well as whole-brain responses, to AI-feedback, and how this interacts with internal beliefs, response biases, and trust in the AI-agent.
Weaknesses:
There is little discussion of how the poor performance (close to 50% chance) may have affected performance on the task, such as by leading to entirely random guessing or overreliance on response biases. This can change how error-monitoring signals presented, as they are affected by participants' accuracy, as well as affecting how the AI feedback is perceived.
The task design and performance make it hard to assess how much it was truly measuring "trust" in an AI agent's feedback. The AI-feedback is yoked to the participants' performance, agreeing on 75% of trials and disagreeing on 25% (randomly), which is an important difference from the framing provided of human-AI partnerships, where AI-agents usually act independently from the humans and thus disagreements offer information about the human's own performance. In this task, disagreements are uninformative, and coupled with the at-chance performance on an ambiguous task, it is not clear how participants should be interpreting disagreements, and whether they treat it like receiving feedback about the accuracy of their choices, or whether they realise it is uninformative. Much greater discussion and justification are needed about the behaviour in the task, how participants did/should treat the feedback, and how these affect the trust/reliability ratings, as these are all central to the claims of the paper.
There are a lot of EEG results presented here, including whole-brain and window-free analyses, so greater clarity on which results were a priori hypothesised should be given, along with details on how electrodes were selected for ERPs and follow-up tests.
Reviewer #3 (Public review):
The current paper investigates neural correlates of trust development in human-AI interaction, looking at EEG signatures locked to the moment that AI advice is presented. The key finding is that both human-response-locked EEG signatures (the CPP) and post-AI-advice signatures (N2, P3) are modulated by trust ratings. The study is interesting, however, it does have some clear and sometimes problematic weaknesses:
(1) The authors did not include "AI-advice". Instead, a manikin turned green or blue, which was framed as AI advice. It is unclear whether participants viewed this as actual AI advice.
(2) The authors did not include a "non-AI" control condition in their experiment, such that we cannot know how specific all of these effects are to AI, or just generic uncertain feedback processing.
(3) Participants perform the task at chance level. This makes it unclear to what extent they even tried to perform the task or just randomly pressed buttons. These situations likely differ substantially from a real-life scenario where humans perform an actual task (which is not impossible) and receive actual AI advice.
(4) Many of the conclusions in the paper are overstated or very generic.
Author response:
A major point all three reviewers raise is that the ‘human-AI collaboration’ in our experiment may not be true collaboration (as the AI does not classify images per se), but that it is only implied. The reviewers pointed out that whether participants were genuinely engaged in our experimental task is currently not sufficiently addressed. We plan to address this issue in the revised manuscript by including results from a brief interview we conducted after the experiment with each participant, which asked about the participant’s experience and decision-making processes while performing the task. Additionally, we also measured the participants’ propensity to trust in AI via a questionnaire before and after the experiment. The questionnaire and interview results will allow us to more accurately describe the involvement of our participants in the task. Additionally, we will conduct additional analyses of the behavioural data (e.g., response times) to show that participants genuinely completed the experimental task. Finally, we will work to sharpen our language and conclusions in the revised manuscript, following the reviewers’ recommendations.
Reviewer #1:
Summary:
In the study by Roeder and colleagues, the authors aim to identify the psychophysiological markers of trust during the evaluation of matching or mismatching AI decision-making. Specifically, they aim to characterize through brain activity how the decision made by an AI can be monitored throughout time in a two-step decision-making task. The objective of this study is to unfold, through continuous brain activity recording, the general information processing sequence while interacting with an artificial agent, and how internal as well as external information interact and modify this processing. Additionally, the authors provide a subset of factors affecting this information processing for both decisions.
Strengths:
The study addresses a wide and important topic of the value attributed to AI decisions and their impact on our own confidence in decision-making. It especially questions some of the factors modulating the dynamical adaptation of trust in AI decisions. Factors such as perceived reliability, type of image, mismatch, or participants' bias toward one response or the other are very relevant to the question in human-AI interactions.
Interestingly, the authors also question the processing of more ambiguous stimuli, with no real ground truth. This gets closer to everyday life situations where people have to make decisions in uncertain environments. Having a better understanding of how those decisions are made is very relevant in many domains.
Also, the method for processing behavioural and especially EEG data is overall very robust and is what is currently recommended for statistical analyses for group studies. Additionally, authors provide complete figures with all robustness evaluation information. The results and statistics are very detailed. This promotes confidence, but also replicability of results.
An additional interesting method aspect is that it is addressing a large window of analysis and the interaction between three timeframes (evidence accumulation pre-decision, decision-making, post-AI decision processing) within the same trials. This type of analysis is quite innovative in the sense that it is not yet a standard in complex experimental designs. It moves forward from classical short-time windows and baseline ERP analysis.
We appreciate the constructive appraisal of our work.
Weaknesses:
R1.1. This manuscript raises several conceptual and theoretical considerations that are not necessarily answered by the methods (especially the task) used. Even though the authors propose to assess trust dynamics and violations in cooperative human-AI teaming decision-making, I don't believe their task resolves such a question. Indeed, there is no direct link between the human decision and the AI decision. They do not cooperate per se, and the AI decision doesn't seem, from what I understood to have an impact on the participants' decision making. The authors make several assumptions regarding trust, feedback, response expectation, and "classification" (i.e., match vs. mismatch) which seem far stretched when considering the scientific literature on these topics.
This issue is raised by the other reviewers as well. The reviewer is correct in that the AI does not classify images but that the AI response is dependent on the participants’ choice (agree in 75% of trials, disagree in 25% of the trials). Importantly, though, participants were briefed before and during the experiment that the AI is doing its own independent image classification and that human input is needed to assess how well the AI image classification works. That is, participants were led to believe in a genuine, independent AI image classifier on this experiment.
Moreover, the images we presented in the experiment were taken from previous work by Nightingale & Farid (2022). This image dataset includes ‘fake’ (AI generated) images that are indistinguishable from real images.
What matters most for our work is that the participants were truly engaging in the experimental task; that is, they were genuinely judging face images, and they were genuinely evaluating the AI feedback. There is strong indication that this was indeed the case. We conducted and recorded brief interviews after the experiment, asking our participants about their experience and decision-making processes. The questions are as follows:
(1) How did you make the judgements about the images?
(2) How confident were you about your judgement?
(3) What did you feel when you saw the AI response?
(4) Did that change during the trials?
(5) Who do you think it was correct?
(6) Did you feel surprised at any of the AI responses?
(7) How did you judge what to put for the reliability sliders?
In our revised manuscript we will conduct additional analyses to provide detail on participants’ engagement in the task; both in the judging of the AI faces, as well as in considering the AI feedback. In addition, we will investigate the EEG signal and response time to check for effects that carry over between trials. We will also frame our findings more carefully taking scientific literature into account.
Nightingale SJ, and Farid H. "AI-synthesized faces are indistinguishable from real faces and more trustworthy." Proceedings of the National Academy of Sciences 119.8 (2022): e2120481119.
R1.2. Unlike what is done for the data processing, the authors have not managed to take the big picture of the theoretical implications of their results. A big part of this study's interpretation aims to have their results fit into the theoretical box of the neural markers of performance monitoring.
We indeed used primarily the theoretical box of performance monitoring and predictive coding, since the make-up of our task is similar to a more classical EEG oddball paradigm. In our revised manuscript, we will re-frame and address the link of our findings with the theoretical framework of evidence accumulation and decision confidence.
R1.3. Overall, the analysis method was very robust and well-managed, but the experimental task they have set up does not allow to support their claim. Here, they seem to be assessing the impact of a mismatch between two independent decisions.
Although the human and AI decisions are independent in the current experiment, the EEG results still shed light on the participant’s neural processes, as long as the participant considers the AI’s decision and believes it to be genuine. An experiment in which both decisions carry effective consequences for the task and the human-AI cooperation would be an interesting follow-up study.
Nevertheless, this type of work is very important to various communities. First, it addresses topical concerns associated with the introduction of AI in our daily life and decisions, but it also addresses methodological difficulties that the EEG community has been having to move slowly away from the static event-based short-timeframe analyses onto a more dynamic evaluation of the unfolding of cognitive processes and their interactions. The topic of trust toward AI in cooperative decision making has also been raised by many communities, and understanding the dynamics of trust, as well as the factors modulating it, is of concern to many high-risk environments, or even everyday life contexts. Policy makers are especially interested in this kind of research output.
Reviewer #2:
Summary:
The authors investigated how "AI-agent" feedback is perceived in an ambiguous classification task, and categorised the neural responses to this. They asked participants to classify real or fake faces, and presented an AI-agent's feedback afterwards, where the AI-feedback disagreed with the participants' response on a random 25% of trials (called mismatches). Pre-response ERP was sensitive to participants' classification as real or fake, while ERPs after the AI-feedback were sensitive to AI-mismatches, with stronger N2 and P3a&b components. There was an interaction of these effects, with mismatches after a "Fake" response affecting the N2 and those after "Real" responses affecting P3a&b. The ERPs were also sensitive to the participants' response biases, and their subjective ratings of the AI agent's reliability.
Strengths:
The researchers address an interesting question, and extend the AI-feedback paradigm to ambiguous tasks without veridical feedback, which is closer to many real-world tasks. The in-depth analysis of ERPs provides a detailed categorisation of several ERPs, as well as whole-brain responses, to AI-feedback, and how this interacts with internal beliefs, response biases, and trust in the AI-agent.
We thank the reviewer for their time in reading and reviewing our manuscript.
Weaknesses:
R2.1. There is little discussion of how the poor performance (close to 50% chance) may have affected performance on the task, such as by leading to entirely random guessing or overreliance on response biases. This can change how error-monitoring signals presented, as they are affected by participants' accuracy, as well as affecting how the AI feedback is perceived.
The images were chosen from a previous study (Nightingale & Farid, 2022, PNAS) that looked specifically at performance accuracy and also found levels around 50%. Hence, ‘fake’ and ‘real’ images are indistinguishable in this image dataset. Our findings agree with the original study.
Judging based on the brief interviews after the experiment (see answer to R.1.1.), all participants were actively and genuinely engaged in the task, hence, it is unlikely that they pressed buttons at random. As mentioned above, we will include a formal analysis of the interviews in the revised manuscript.
The response bias might indeed play a role in how participants responded, and this might be related to their initial propensity to trust in AI. We have questionnaire data available that might shed light on this issue: before and after the experiment, all participants answered the following questions with a 5-point Likert scale ranging from ‘Not True’ to ‘Completely True’:
(1) Generally, I trust AI.
(2) AI helps me solve many problems.
(3) I think it's a good idea to rely on AI for help.
(4) I don't trust the information I get from AI.
(5) AI is reliable.
(6) I rely on AI.
The propensity to trust questionnaire is adapted from Jessup SA, Schneider T R, Alarcon GM, Ryan TJ, & Capiola A. (2019). The measurement of the propensity to trust automation. International Conference on Human-Computer Interaction.
Our initial analyses did not find a strong link between the initial (before the experiment) responses to these questions, and how images were rated during the experiment. We will re-visit this analysis and add the results to the revised manuscript.
Regarding how error-monitoring (or the equivalent thereof in our experiment) is perceived, we will analyse interview questions 3 (“What did you feel when you saw the AI response”) and 6 (“Did you feel surprised at any of the AI responses”) and add results to the revised manuscript.
The task design and performance make it hard to assess how much it was truly measuring "trust" in an AI agent's feedback. The AI-feedback is yoked to the participants' performance, agreeing on 75% of trials and disagreeing on 25% (randomly), which is an important difference from the framing provided of human-AI partnerships, where AI-agents usually act independently from the humans and thus disagreements offer information about the human's own performance. In this task, disagreements are uninformative, and coupled with the at-chance performance on an ambiguous task, it is not clear how participants should be interpreting disagreements, and whether they treat it like receiving feedback about the accuracy of their choices, or whether they realise it is uninformative. Much greater discussion and justification are needed about the behaviour in the task, how participants did/should treat the feedback, and how these affect the trust/reliability ratings, as these are all central to the claims of the paper.
In our experiment, the AI disagreements are indeed uninformative for the purpose of making a correct judgment (that is, correctly classifying images as real or fake). However, given that the AI-generated faces are so realistic and indistinguishable from the real faces, the correctness of the judgement is not the main experimental factor in this study. We argue that, provided participants were genuinely engaged in the task, their judgment accuracy is less important than their internal experience when the goal is to examine processes occurring within the participants themselves. We briefed our participants as follows before the experiment:
“Technology can now create hyper-realistic images of people that do not exist. We are interested in your view on how well our AI system performs at identifying whether images of people’s faces are real or fake (computer-generated). Human input is needed to determine when a face looks real or fake. You will be asked to rate images as real or fake. The AI system will also independently rate the images. You will rate how reliable the AI is several times throughout the experiment.”
We plan to more fully expand the behavioural aspect and our participants’ experience in the revised manuscript by reporting the brief post-experiment interview (R.1.1.), the propensity to trust questionnaire (R.2.1.), and additional analyses of the response times.
There are a lot of EEG results presented here, including whole-brain and window-free analyses, so greater clarity on which results were a priori hypothesised should be given, along with details on how electrodes were selected for ERPs and follow-up tests.
We chose the electrodes mainly to be consistent across findings, and opted to use central electrodes (Pz and Fz), as long as the electrode was part of the electrodes within the reported cluster. We can in our revised manuscript also report on the electrodes with the maximal statistic, as part of a more complete and descriptive overview. We will also report on where we expected to see ERP components within the paper. In short, we did expect something like a P3, and we did also expect to see something before the response what we call the CPP. The rest of the work was more exploratory, with a more careful expectation that bias would be connected to the CPP, and the reliability ratings more to the P3; however, we find the opposite results. We will include this in our revised work as well.
We selected the electrodes primarily to maintain consistency across our findings and figures, and focused on central electrodes (Pz and Fz), provided they fell within the reported cluster. In the revised manuscript, we will also report the electrodes showing the maximal statistical effects to give a more complete and descriptive overview. Additionally, we will report where we expected specific ERP components to appear. In brief, we expected to see a P3 component post AI feedback, and a pre-response signal corresponding to the CPP. Beyond these expectations, the remaining analyses were more exploratory. Although we tentatively expected bias to relate to the CPP and reliability ratings to the P3, our results showed the opposite pattern. We will clarify this in the revised version of the manuscript.
Reviewer #3:
The current paper investigates neural correlates of trust development in human-AI interaction, looking at EEG signatures locked to the moment that AI advice is presented. The key finding is that both human-response-locked EEG signatures (the CPP) and post-AI-advice signatures (N2, P3) are modulated by trust ratings. The study is interesting, however, it does have some clear and sometimes problematic weaknesses:
(1) The authors did not include "AI-advice". Instead, a manikin turned green or blue, which was framed as AI advice. It is unclear whether participants viewed this as actual AI advice.
This point has been raised by the other reviewers as well, and we refer to the answers under R1.1., and under R2.1. We will address this concern by analysing the post-experiment interviews. In particular, questions 3 (“What did you feel when you saw the AI response”), 4 (“Did that change during the trials?”) and 6 (“Did you feel surprised at any of the AI responses”) will give critical insight. As stated above, our general impression from conducting the interviews is that all participants considered the robot icon as decision from an independent AI agent.
(2) The authors did not include a "non-AI" control condition in their experiment, such that we cannot know how specific all of these effects are to AI, or just generic uncertain feedback processing.
In the conceptualization phase of this study, we indeed considered different control conditions for our experiment to contrast different kinds of feedback. However, previous EEG studies on performance monitoring ERPs have reported similar results for human and machine supervision (Somon et al., 2019; de Visser et al., 2018). We therefore decided to focus on one aspect (the judgement of observation of an AI classification), also to prevent the experiment from taking too long and risking that participants would lose concentration and motivation to complete the experiment. Comparing AI vs non-AI feedback, is still interesting and would be a valuable follow-up study.
Somon B, et al. "Human or not human? Performance monitoring ERPs during human agent and machine supervision." NeuroImage 186 (2019): 266-277.
De Visser EJ, et al. "Learning from the slips of others: Neural correlates of trust in automated agents." Frontiers in human neuroscience 12 (2018): 309.
(3) Participants perform the task at chance level. This makes it unclear to what extent they even tried to perform the task or just randomly pressed buttons. These situations likely differ substantially from a real-life scenario where humans perform an actual task (which is not impossible) and receive actual AI advice.
This concern was also raised by the other two reviewers. As already stated in our responses above, we will add results from the post-experiment interviews with the participants, the propensity to trust questionnaire, and additional behavioural analyses in our revised manuscript.
Reviewer 1 (R1.3) also brought up the situation where decisions by the participant and the AI have a more direct link which carries consequences. This will be valuable follow-up research. In the revised manuscript, we will more carefully frame our approach.
(4) Many of the conclusions in the paper are overstated or very generic.
In the revised manuscript, we will re-phrase our discussion and conclusions to address the points raised in the reviewer’s recommendations to authors.
eLife Assessment
This important study provides convincing evidence that envelope-carrying Ty3/gypsy retrotransposons (errantiviruses) are ancient, widespread, and actively expanding across nearly all major animal phyla. Using comprehensive phylogenetic and AlphaFold2-based structural analyses, the authors show that these elements independently acquired membrane fusion proteins early in metazoan evolution, likely predating the bilaterian-non-bilaterian split. While some aspects could be more clearly contextualized and explained better, the work offers insights into the deep evolutionary roots of retroelement-envelope associations and the origins of retroviruses.
Reviewer #1 (Public review):
Summary:
This manuscript provides a comprehensive systematic analysis of envelope-containing Ty3/gypsy retrotransposons (errantiviruses) across metazoan genomes, including both invertebrates and ancient animal lineages. Using iterative tBLASTn mining of over 1,900 genomes, the authors catalog 1,512 intact retrotransposons with uninterrupted gag, pol, and env open reading frames. They show that these elements are widespread-present in most metazoan phyla, including cnidarians, ctenophores, and tunicates-with active proliferation indicated by their multicopy status. Phylogenetic analyses distinguish "ancient" and "insect" errantivirus clades, while structural characterization (including AlphaFold2 modeling) reveals two major env types: paramyxovirus F-like and herpesvirus gB-like proteins. Although bot envelope types were identified in previous analyses two decades ago, the evolutionary provenance of these envelope genes was almost rudimentary and anecdotal (I can say this because I authored one of these studies). The results in the present study support an ancient origin for env acquisition in metazoan Ty3/gypsy elements, with subsequent vertical inheritance and limited recombination between env and pol domains. The paper also proposes an expanded definition of 'errantivirus' for env-carrying Ty3/gypsy elements outside Drosophila.
Strengths:
(1) Comprehensive Genomic Survey:<br /> The breadth of the genome search across non-model metazoan phyla yields an impressive dataset covering evolutionary breadth, with clear documentation of search iterations and validation criteria for intact elements.
(2) Robust Phylogenetic Inference:<br /> The use of maximum likelihood trees on both pol and env domains, with thorough congruence analysis, convincingly separates ancient from lineage-specific elements and demonstrates co-evolution of env and pol within clades.
(3) Structural Insights:<br /> AlphaFold2-based predictions provide high-confidence structural evidence that both env types have retained fusion-competent architectures, supporting the hypothesis of preserved functional potential.
(4) Novelty and Scope:<br /> The study challenges previous assumptions of insect-centric or recent env acquisition and makes a compelling case for a Pre-Cambrian origin, significantly advancing our understanding of animal retroelement diversity and evolution. THIS IS A MAJOR ADVANCE.
(5) Data Transparency:<br /> I appreciate that all data, code, and predicted structures are made openly available, facilitating reproducibility and future comparative analyses.
Major Weaknesses
(1) Functional Evidence Gaps:<br /> The work rests largely on sequence and structure prediction. No direct expression or experimental validation of envelope gene function or infectivity outside Drosophila is attempted, which would be valuable to corroborate the inferred roles of these glycoproteins in non-insect lineages. At least for some of these species, there are RNA-seq datasets that could be leveraged.
(2) Horizontal Transfer vs. Loss Hypotheses:<br /> The discussion argues primarily for vertical inheritance, but the somewhat sporadic phylogenetic distributions and long-branch effects suggest that loss and possibly rare horizontal events may contribute more than acknowledged. Explicit quantitative tests for horizontal transfer, or reconciliation analyses, would strengthen this conclusion. It's also worth pointing out that, unlike retrotransposons that can be found in genomes, any potential related viral envelopes must, by definition, have a spottier distribution due to sampling. I don't think this challenges any of the conclusions, but it must be acknowledged as something that could affect the strength of this conclusion
(3) Limited Taxon Sampling for Certain Phyla:<br /> Despite the impressive breadth, some ancient lineages (e.g., Porifera, Echinodermata) are negative, but the manuscript does not fully explore whether this reflects real biological absence, assembly quality, or insufficient sampling. A more systematic treatment of negative findings would clarify claims of ubiquity. However, I also believe this falls beyond the scope of this study.
(4) Mechanistic Ambiguity:<br /> The proposed model that env-containing elements exploit ovarian somatic niches is plausible but extrapolated from Drosophila data; for most taxa, actual tissue specificity, lifecycle, or host interaction mechanisms remain speculative and, to me, a bit unreasonable.
Minor Weaknesses:
(1) Terminology and Nomenclature:<br /> The paper introduces and then generalizes the term "errantivirus" to non-insect elements. While this is logical, it may confuse readers familiar with the established, Drosophila-centric definition if not more explicitly clarified throughout. I also worry about changes being made without any input from the ICTV nomenclature committee, which just went through a thorough reclassification. Nevertheless, change is expected, and calling them all errantiviruses is entirely reasonable.
(2) Figures and Supplementary Data Navigation:<br /> Some key phylogenies and domain alignments are found only in supplementary figures, occasionally hindering readability for non-expert audiences. Selected main-text inclusion of representative trees would benefit accessibility.
(3) ORF Integrity Thresholds:<br /> The cutoff choices for defining "intact" elements (e.g., numbers/placement of stop codons, length ranges) are reasonable but only lightly justified. More rationale or sensitivity analysis would improve confidence in the inclusion criteria. For example, how did changing these criteria change the number of intact elements?
(4) Minor Typos/Formatting:<br /> The paper contains sporadic typographical errors and formatting glitches (e.g., misaligned figure labels, unrendered symbols) that should be addressed.
Reviewer #2 (Public review):
Summary:
The authors first surveyed metazoan genomes to identify homologs of Drosophila errantiviruses and classified them into two groups, "insect" and "ancient" elements, supporting the hypothesis of an early evolutionary origin for these retrotransposons. They subsequently identified two distinct types of envelope proteins, one resembling the glycoprotein F of paramyxoviruses and the other akin to the glycoprotein B of herpesviruses. Despite differences in their primary amino acid sequences, these proteins display notable structural similarity in their predicted domain architectures. The congruence between the phylogenies of the envelope and pol genes further supports the ancient origin of the envelope genes, challenging earlier hypotheses that proposed recent recombination events with baculoviruses. Additional analysis of the Pol "bridge region" corroborated the divergence among these elements, consistent with a pattern of limited cross-species recombination. Finally, by comparing these elements with non-envelope-containing Gypsy retrotransposons, the authors concluded that errantiviruses originated from multiple elements independently.
Strengths:
The conclusions of this study are based on a comprehensive collection of errantiviruses identified across a wide range of metazoan genomes. These findings are further supported by multiple lines of evidence, including phylogenetic congruence and the diverse evolutionary origins of envelope genes. AlphaFold2-assisted protein domain structure analyses also provided key insights into the characterization of these elements. Together, these results present a compelling case that errantiviruses arose independently through multiple evolutionary events, extending well beyond previous hypotheses.
Weaknesses:
It would be beneficial to emphasize in the Abstract the potential impact of this work by more clearly articulating the current knowledge gap in the field. While the second paragraph of the Introduction briefly touches on this point, highlighting the broader significance in the Abstract would better capture readers' interest. Additionally, some methodological choices would benefit from clearer justification and explanation. For instance, in Figure 6, the selection of the bridge region/RNase H domain is not explicitly explained, leaving the rationale for its choice unclear. As a minor point, some figure labels and texts are too small and difficult to read, and improving their legibility would enhance overall clarity.
Reviewer #3 (Public review):
Summary and Significance:
In this work, Cary and Hayashi address the important question of when, in evolution, certain mobile genetic elements (Ty3/gypsy-like non-LTR retrotransposons) associated with certain membrane fusion proteins (viral glycoprotein F or B-like proteins), which could allow these mobile genetic elements to be transferred between individual cells of a given host. It is debated in the literature whether the acquisition of membrane fusion proteins by non-LTR retrotransposons is a rather recent phenomenon that separately occurred in the ancestors of certain host species or whether the association with membrane fusion proteins is a much more ancient one, pre-dating the Cambrian explosion. Obviously, this question also touches upon the origin of the retroviruses, which can spread between individuals of a given host but seem restricted to vertebrates. Based on convincing data, Cary and Hayashi argue that an ancient association of non-LTR retrotransposons with membrane fusion proteins is most probable.
Strengths:
The authors take the smart approach to systematically retrieve apparently complete, intact, and recently functional Ty3/gypsy-like non-LTR retrotransposons that, next to their characteristic gag and pol genes, additionally carry sequences that are homologous to viral glycoprotein F (env-F) or viral glycoprotein B (env-B). They then construct and compare phylogenetic trees of the host species and individual encoded proteins and protein domains, where 3D-structure calculations and other features explain and corroborate the clustering within the phylogenetic trees. Congruence of phylogenetic trees and correlation of structural features is then taken as evidence for an infrequent recombination and a long-term co-evolution of the reverse transcriptase (encoded by the pol gene) and its respective putative membrane fusion gene (encoded by env-F or env-B). Importantly, the env-F and env-B containing retrotransposons do not form a monophyletic group among the Ty3/gypsy-like non-LTR retrotransposons, but are scattered throughout, supporting the idea of an originally ancient association followed by a random loss of env-F/env-B in individual branches of the tree (and rather rare re-associations via more recent recombinations).
Overall, this is valuable, stimulating, and important work of general and fundamental interest, but still also somewhat incompletely explored, imprecisely explained, and insufficiently put into context for a more general audience.
Weaknesses:
Some points that might be considered and clarified:
(1) Imprecise explanations, terms, and definitions:
It might help to add a 'definitions box' or similar to precisely explain how the authors decided to use certain terms in this manuscript, and then use these terms consistently and with precision.
a) In particular, these are terms such as 'vertebrate retrovirus' vs 'retrovirus' vs 'endogenized retrovirus' vs 'endogenous retrovirus' vs 'non-LTR retrotransposon' and 'Ty3/gypsi-like retrotransposon' vs 'Ty3/gypsy retrotransposon' vs 'errantivirus'.
b) The comment also applies to the term 'env' used for both 'env-F' and 'env-B', where often it remains unclear which of the two protein types the authors refer to. This is confusing, particularly in the methods, where the search for the respective homologs is described.
c) Other examples are the use of the entire pol gene vs. pol-RT for the definition of the Ty3/gypsy clade and for the generation of phylogenetic trees (Methods and Figure S1), and the names for various portions of pol that appear without prior definition or explanation (e.g., 'pro' in Figure 1A, 'bridge' in Figure S1C, 'the chromodomain' in the text and Figure 7).
d) It is unclear from the main text which portions of pol were chosen to define pol-RT and why. The methods name the 'palm-and-fingers', 'thumb', and 'connections' domains to define RT. In the main text, the 'connection' domain is called 'tether' and is instead defined as part of the 'bridge' region following RT, which is not part of RT.
(2) Insufficient broader context:
a) The introduction does not state what defines Ty3/gypsy non-LTR retrotransposons as compared to their closest relatives (Ty1/copia retrotransposons, BEL/pao retrotransposons, vertebrate retroviruses). This makes it difficult to judge the significance and generality of the findings.
b) The various known compositions of Ty3/gypsi-like retrotransposons are not mentioned and explained in the introduction (open reading frames, (poly-)proteins and protein domains, and their variable arrangement, enzymatic activities, and putative functions), and the distribution of Ty3/gypsi-like retrotransposons among eukaryotes remains unclear. The introduction does not mention that Ty3/gypsi-like retrotransposons apparently are absent from vertebrates, and Figure 7 is not very clear about whether or not it includes sequences from plants ('Chromoviridae').
c) The known association of Ty3/gypsi-like retrotransposons from different metazoan phyla with putative membrane fusion proteins (env-like) genes is mentioned in the introduction, but literature information, whether such associations also occur in the context of other retrotransposons (e.g., Ty1/ copia or BEL/pao), is not provided. The abstract is somewhat misleading in this respect. Finally, the different known types of env-like genes are not mentioned and explained as part of the introduction ('env-f', 'env-B', 'retroviral env', others?)
d) Some key references and reviews might be added:
- Pelisson, A. et al. (1994) https://www.embopress.org/doi/abs/10.1002/j.1460-2075.1994.tb06760.x<br /> (next to Song et al. (1994), for the identification of env in Ty3/gypsy)
- Boeke, J.D. et al. (1999)<br /> In Virus Taxonomy: ICTV VIIth report. (ed. F.A. Murphy),. Springer-Verlag, New York.<br /> (cited by Malik et al. (2000) - for the definition and first use of the term 'errantivirus')
- Eickbush, T.H. and Jamburuthugoda, V.K. (2008) https://doi.org/10.1016/j.virusres.2007.12.010<br /> (on the classification of retrotransposons and their env-like genes)
- Hayward, A. (2017) https://doi.org/10.1016/j.coviro.2017.06.006<br /> (on scenarios of env acquisition)
(3) Incomplete analysis:
a) Mobile genetic elements are sometimes difficult to assemble correctly from short-read sequencing data. Did the authors confirm some of their newly identified elements by e.g., PCR analysis or re-identification in long-read sequencing data?
b) The authors mention somewhat on the side that there are Ty3/gypsy elements with a different arrangement (gag-env-pol instead of gag-pol-env). Why was this important feature apparently not used and correlated in the analysis? How does it map on the RT phylogenetic tree? Which type of env is found with either arrangement? Is there evidence for a loss of env also in the case of gag-env-pol elements?
c) Sankey plots are insufficiently explained. How would inconsistencies between trees (recombinations) show up here? Why is there no Sankey plot for the analysis of env-B in Figure 5?
d) Why are there no trees generated for env-F and env-B like proteins, including closely related homologous sequences that do NOT come from Ty3/gypsy retrotransposons (e.g., from the eukaryotic hosts, from other types of retrotransposons (Ty1/copia or BEL/pao), from viruses such as Herpesvirus and Baculovirus)? It would be informative whether the sequences from Ty3/gypsy cluster together in this case.
e) Did the authors identify any other env-like ORFs (apart from env-F and env-B) among Ty3/gypsy retrotransposons? Did they identify other, non-env-like ORFs that might help in the analysis? It is not quite clear from the methods if the searches for env-F and env-B - containing Ty3/gypsy elements were done separately and consecutively or somehow combined (the authors generally use 'env', and it is not clear which type of protein this refers to).
f) Why was the gag protein apparently not used to support the analysis? Are there different, unrelated types of gag among non-LTR retrotransposons? Does gag follow or break the pattern of co-evolution between RT and env-F/env-B?
g) Data availability. The link given in the paper does not seem to work (https://github.com/RippeiHayashi/errantiviruses_2025/tree/main). It would be useful for the community to have the sequences of the newly identified Ty3/gypsy retrotransposons listed readily available (not just genome coordinates as in table S1), together with the respective annotations of ORFs and features.
Author response:
We appreciate thorough and highly valuable feedback from the reviewers. We will take their suggestions on board and prepare a revised manuscript focusing on the following points:
(1) As reviewers pointed out, we did not evaluate horizontal transfer events of env-containing Ty3/gypsy elements. We consistently observed that elements found in the same phylum/class/superfamily cluster together in the POL phylogenetic tree, suggesting an ancient acquisition of env to the Ty3/gypsy elements—separation should not be as clear as we observed should they had been frequently gained from animals across different phylum/class/superfamilies. However, this does not exclude more recent horizontal transfer events that may occur between closely related species. We will perform gene-tree species-tree reconciliation analyses in clades that have enough elements and represented species to estimate the frequency of horizontal transfer events.
(2) We did not find env-containing Ty3/gypsy elements in some animal phyla such as Echinodermata and Porifera, but this could be due to the quality or number of available genome assemblies as reviewers suggested. To address this, we will mine GAG-POL gypsy elements in the genomes that were devoid of GAG-POL-ENV elements and compare their abundance with other genomes that carry GAG-POL-ENV elements. If GAG-POL gypsy elements were similarly abundantly identified, that would indicate that the observed absence of GAG-POL-ENV elements is not due to poor quality of genome assemblies.
(3) We will include F-type and HSV-gB type ENV proteins from known viruses in the phylogenetic analysis to investigate their ancestry and potential recombination events with env-containing Ty3/gypsy elements.
(4) Wherever relevant, we will clarify the terms using in the manuscript, provide rationale to our selection of POL domains used for structural and phylogenetic analyses, improve accessibility of figures, touch on gypsy elements in vertebrates, and make sure all concepts covered in the results are sufficiently introduced in the introduction.
eLife Assessment
This important study provides convincing evidence that glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, plays a role in the differentiation of intestinal cells. Mutations in GlcT compromise Notch signaling in the Drosophila intestinal stem cell lineage, resulting in the formation of enteroendocrine tumors. Further data suggest that a homolog of glucosylceramide synthase also influences Notch signaling in the mammalian intestine. While the outstanding strengths of the initial genetic and downstream pathway analyses are noted, there are minor weaknesses in the data regarding the potential role of this pathway in Delta trafficking. Nevertheless, this study opens the way for future mechanistic studies addressing how specific lipids modulate Notch signalling activity.
Reviewer #1 (Public review):
Summary:
From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in ee tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.
Strengths:
Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. The authors have addressed my concerns from the prior review.
Reviewer #2 (Public review):
Summary:
This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.
Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.
Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.
Strengths:
The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.
The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
Summary:
From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in EE tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect the endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.
Strengths:
Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. I do, however, have a few comments below.
Weaknesses:
(1) The authors bring up the intriguing idea that GlcT could be a way to link diet to cell fate choice. Unfortunately, there are no experiments to test this hypothesis.
We indeed attempted to establish an assay to investigate the impact of various diets (such as high-fat, high-sugar, or high-protein diets) on the fate choice of ISCs. Subsequently, we intended to examine the potential involvement of GlcT in this process. However, we observed that the number or percentage of EEs varies significantly among individuals, even among flies with identical phenotypes subjected to the same nutritional regimen. We suspect that the proliferative status of ISCs and the turnover rate of EEs may significantly influence the number of EEs present in the intestinal epithelium, complicating the interpretation of our results. Consequently, we are unable to conduct this experiment at this time. The hypothesis suggesting that GlcT may link diet to cell fate choice remains an avenue for future experimental exploration.
(2) Why do the authors think that UCCG knockout results in goblet cell excess and not in the other secretory cell types?
This is indeed an interesting point. In the mouse intestine, it is well-documented that the knockout of Notch receptors or Delta-like ligands results in a classic phenotype characterized by goblet cell hyperplasia, with little impact on the other secretory cell types. This finding aligns very well with our experimental results, as we noted that the numbers of Paneth cells and enteroendocrine cells appear to be largely normal in UGCG knockout mice. By contrast, increases in other secretory cell types are typically observed under conditions of pharmacological inhibition of the Notch pathway.
(3) The authors should cite other EMS mutagenesis screens done in the fly intestine.
To our knowledge, the EMS screen on 2L chromosome conducted in Allison Bardin’s lab is the only one prior to this work, which leads to two publications (Perdigoto et al., 2011; Gervais, et al., 2019). We have now included citations for both papers in the revised manuscript.
(4) The absence of a phenotype using NRE-Gal4 is not convincing. This is because the delay in its expression could be after the requirement for the affected gene in the process being studied. In other words, sufficient knockdown of GlcT by RNA would not be achieved until after the relevant signaling between the EB and the ISC occurred. Dl-Gal4 is problematic as an ISC driver because Dl is expressed in the EEP.
This is an excellent point, and we agree that the lack of an observable phenotype using NRE-Gal4 could be due to delayed expression, which may result in missing the critical window required for effective GlcT knockdown. Consequently, we cannot rule out the possibility that GlcT also plays a role in early EBs or EEPs. We have revised the manuscript to soften this conclusion and to include this alternative explanation for the experiment.
(5) The difference in Rab5 between control and GlcT-IR was not that significant. Furthermore, any changes could be secondary to increases in proliferation.
We agree that it is possible that the observed increase in proliferation could influence the number of Rab5+ endosomes, and we will temper our conclusions on this aspect accordingly. However, it is important to note that, although the difference in Rab5+ endosomes between the control and GlcT-IR conditions appeared mild, it was statistically significant and reproducible. In our revised experiments, we have not only added statistical data and immunofluorescence images for Rab11 but also unified the approaches used for detecting Rab-associated proteins (in the previous figures, Rab5 was shown using U-Rab5-GFP, whereas Rab7 was detected by direct antibody staining). Based on this unified strategy, we optimized the quantification of Dl-GFP colocalization with early, late, and recycling endosomes, and the results are consistent with our previous observations (see the updated Fig. 5).
Reviewer #2 (Public review):
Summary:
This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, which act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.
Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady-state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.
Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.
Strengths:
The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.
The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.
Weaknesses:
This study is not, however, without caveats and several specific conclusions are not fully convincing.
First, the conclusion that GlcT is specifically required in Intestinal Stem Cells (ISCs) is not fully convincing for technical reasons: NRE-Gal4 may be less active in GlcT mutant cells, and the knock-down of GlcT using Dl-Gal4ts may not be restricted to ISCs given the perdurance of Gal4 and of its downstream RNAi.
As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and explicitly described this possibility in the updated version.
Second, the results from the antibody uptake assays are not clear.: i) the levels of internalized Delta were not quantified in these experiments; ii) additionally, live guts were incubated with anti-Delta for 3hr. This long period of incubation indicated that the observed results may not necessarily reflect the dynamics of endocytosis of antibody-bound Delta, but might also inform about the distribution of intracellular Delta following the internalization of unbound anti-Delta. It would thus be interesting to examine the level of internalized Delta in experiments with shorter incubation time.
We thank the reviewer for these excellent questions. In our antibody uptake experiments, we noted that Dl reached its peak accumulation after a 3-hour incubation period. We recognize that quantifying internalized Dl would enhance our analysis, and we will include the corresponding statistical graphs in the revised version of the manuscript. In addition, we agree that during the 3-hour incubation, the potential internalization of unbound anti-Dl cannot be ruled out, as it may influence the observed distribution of intracellular Dl. We therefore attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.
Overall, the proposed working model needs to be solidified as important questions remain open, including: is the endo-lysosomal system, i.e. steady-state distribution of endo-lysosomal markers, affected by the Mac-Cer deficiency? Is the trafficking of Notch also affected by the Mac-Cer deficiency? is the rate of Delta endocytosis also affected by the Mac-Cer deficiency? are the levels of cell-surface Delta reduced upon the loss of Mac-Cer?
Regarding the impact on the endo-lysosomal system, this is indeed an important aspect to explore. While we did not conduct experiments specifically designed to evaluate the steady-state distribution of endo-lysosomal markers, our analyses utilizing Rab5-GFP overexpression and Rab7 staining did not indicate any significant differences in endosome distribution in MacCer deficient conditions. Moreover, we still observed high expression of the NRE-LacZ reporter specifically at the boundaries of clones in GlcT mutant cells (Fig. 4A), indicating that GlcT mutant EBs remain responsive to Dl produced by normal ISCs located right at the clone boundary. Therefore, we propose that MacCer deficiency may specifically affect Dl trafficking without impacting Notch trafficking.
In our 3-hour antibody uptake experiments, we observed a notable decrease in cell-surface Dl, which was accompanied by an increase in intracellular accumulation. These findings collectively suggest that Dl may be unstable on the cell surface, leading to its accumulation in early endosomes.
Third, while the mouse results are potentially interesting, they seem to be relatively preliminary, and future studies are needed to test whether the level of Notch receptor activation is reduced in this model.
In the mouse small intestine, Olfm4 is a well-established target gene of the Notch signaling pathway, and its staining provides a reliable indication of Notch pathway activation. While we attempted to evaluate Notch activation using additional markers, such as Hes1 and NICD, we encountered difficulties, as the corresponding antibody reagents did not perform well in our hands. Despite these challenges, we believe that our findings with Olfm4 provide an important start point for further investigation in the future.
Reviewer #3 (Public review):
Summary:
In this paper, Tang et al report the discovery of a Glycoslyceramide synthase gene, GlcT, which they found in a genetic screen for mutations that generate tumorous growth of stem cells in the gut of Drosophila. The screen was expertly done using a classic mutagenesis/mosaic method. Their initial characterization of the GlcT alleles, which generate endocrine tumors much like mutations in the Notch signaling pathway, is also very nice. Tang et al checked other enzymes in the glycosylceramide pathway and found that the loss of one gene just downstream of GlcT (Egh) gives similar phenotypes to GlcT, whereas three genes further downstream do not replicate the phenotype. Remarkably, dietary supplementation with a predicted GlcT/Egh product, Lactosyl-ceramide, was able to substantially rescue the GlcT mutant phenotype. Based on the phenotypic similarity of the GlcT and Notch phenotypes, the authors show that activated Notch is epistatic to GlcT mutations, suppressing the endocrine tumor phenotype and that GlcT mutant clones have reduced Notch signaling activity. Up to this point, the results are all clear, interesting, and significant. Tang et al then go on to investigate how GlcT mutations might affect Notch signaling, and present results suggesting that GlcT mutation might impair the normal endocytic trafficking of Delta, the Notch ligand. These results (Fig X-XX), unfortunately, are less than convincing; either more conclusive data should be brought to support the Delta trafficking model, or the authors should limit their conclusions regarding how GlcT loss impairs Notch signaling. Given the results shown, it's clear that GlcT affects EE cell differentiation, but whether this is via directly altering Dl/N signaling is not so clear, and other mechanisms could be involved. Overall the paper is an interesting, novel study, but it lacks somewhat in providing mechanistic insight. With conscientious revisions, this could be addressed. We list below specific points that Tang et al should consider as they revise their paper.
Strengths:
The genetic screen is excellent.
The basic characterization of GlcT phenotypes is excellent, as is the downstream pathway analysis.
Weaknesses:
(1) Lines 147-149, Figure 2E: here, the study would benefit from quantitations of the effects of loss of brn, B4GalNAcTA, and a4GT1, even though they appear negative.
We have incorporated the quantifications for the effects of the loss of brn, B4GalNAcTA, and a4GT1 in the updated Figure 2.
(2) In Figure 3, it would be useful to quantify the effects of LacCer on proliferation. The suppression result is very nice, but only effects on Pros+ cell numbers are shown.
We have now added quantifications of the number of EEs per clone to the updated Figure 3.
(3) In Figure 4A/B we see less NRE-LacZ in GlcT mutant clones. Are the data points in Figure 4B per cell or per clone? Please note. Also, there are clearly a few NRE-LacZ+ cells in the mutant clone. How does this happen if GlcT is required for Dl/N signaling?
In Figure 4B, the data points represent the fluorescence intensity per single cell within each clone. It is true that a few NRE-LacZ+ cells can still be observed within the mutant clone; however, this does not contradict our conclusion. As noted, high expression of the NRE-LacZ reporter was specifically observed around the clone boundaries in MacCer deficient cells (Fig. 4A), indicating that the mutant EBs can normally receive Dl signal from the normal ISCs located at the clone boundary and activate the Notch signaling pathway. Therefore, we believe that, although affecting Dl trafficking, MacCer deficiency does not significantly affect Notch trafficking.
(4) Lines 222-225, Figure 5AB: The authors use the NRE-Gal4ts driver to show that GlcT depletion in EBs has no effect. However, this driver is not activated until well into the process of EB commitment, and RNAi's take several days to work, and so the author's conclusion is "specifically required in ISCs" and not at all in EBs may be erroneous.
As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and described this possibility in the updated version.
(5) Figure 5C-F: These results relating to Delta endocytosis are not convincing. The data in Fig 5C are not clear and not quantitated, and the data in Figure 5F are so widely scattered that it seems these co-localizations are difficult to measure. The authors should either remove these data, improve them, or soften the conclusions taken from them. Moreover, it is unclear how the experiments tracing Delta internalization (Fig 5C) could actually work. This is because for this method to work, the anti-Dl antibody would have to pass through the visceral muscle before binding Dl on the ISC cell surface. To my knowledge, antibody transcytosis is not a common phenomenon.
We thank the reviewer for these insightful comments and suggestions. In our in vivo experiments, we observed increased co-localization of Rab5 and Dl in GlcT mutant ISCs, indicating that Dl trafficking is delayed at the transition to Rab7⁺ late endosomes, a finding that is further supported by our antibody uptake experiments. We acknowledge that the data presented in Fig. 5C are not fully quantified and that the co-localization data in Fig. 5F may appear somewhat scattered; therefore, we have included additional quantification and enhanced the data presentation in the revised manuscript.
Regarding the concern about antibody internalization, we appreciate this point. We currently do not know if the antibody reaches the cell surface of ISCs by passing through the visceral muscle or via other routes. Given that the experiment was conducted with fragmented gut, it is possible that the antibody may penetrate into the tissue through mechanisms independent of transcytosis.
As mentioned earlier, we attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.
(6) It is unclear whether MacCer regulates Dl-Notch signaling by modifying Dl directly or by influencing the general endocytic recycling pathway. The authors say they observe increased Dl accumulation in Rab5+ early endosomes but not in Rab7+ late endosomes upon GlcT depletion, suggesting that the recycling endosome pathway, which retrieves Dl back to the cell surface, may be impaired by GlcT loss. To test this, the authors could examine whether recycling endosomes (marked by Rab4 and Rab11) are disrupted in GlcT mutants. Rab11 has been shown to be essential for recycling endosome function in fly ISCs.
We agree that assessing the state of recycling endosomes, especially by using markers such as Rab11, would be valuable in determining whether MacCer regulates Dl-Notch signaling by directly modifying Dl or by influencing the broader endocytic recycling pathway. In the newly added experiments, we found that in GlcT-IR flies, Dl still exhibits partial colocalization with Rab11, and the overall expression pattern of Rab11 is not affected by GlcT knockdown (Fig. 5E-F). These observations suggest that MacCer specifically regulates Dl trafficking rather than broadly affecting the recycling pathway.
(7) It remains unclear whether Dl undergoes post-translational modification by MacCer in the fly gut. At a minimum, the authors should provide biochemical evidence (e.g., Western blot) to determine whether GlcT depletion alters the protein size of Dl.
While we propose that MacCer may function as a component of lipid rafts, facilitating Dl membrane anchorage and endocytosis, we also acknowledge the possibility that MacCer could serve as a substrate for protein modifications of Dl necessary for its proper function. Conducting biochemical analyses to investigate potential post-translational modifications of Dl by MacCer would indeed provide valuable insights. We have performed Western blot analysis to test whether GlcT depletion affects the protein size of Dl. As shown below, we did not detect any apparent changes in the molecular weight of the Dl protein. Therefore, it is unlikely that MacCer regulates post-translational modifications of Dl.
Author response image 1.
To investigate whether MacCer modifies Dl by Western blot,(A) Four lanes were loaded: the first two contained 20 μL of membrane extract (lane 1: GlcT-IR, lane 2: control), while the last two contained 10 μL of membrane extract (B) Full blot images are shown under both long and shortexposure conditions.
(8) It is unfortunate that GlcT doesn't affect Notch signaling in other organs on the fly. This brings into question the Delta trafficking model and the authors should note this. Also, the clonal marker in Figure 6C is not clear.
In the revised working model, we have explicitly described that the events occur in intestinal stem cells. Regarding Figure 6C, we have delineated the clone with a white dashed line to enhance its clarity and visual comprehension.
(9) The authors state that loss of UGCG in the mouse small intestine results in a reduced ISC count. However, in Supplementary Figure C3, Ki67, a marker of ISC proliferation, is significantly increased in UGCG-CKO mice. This contradiction should be clarified. The authors might repeat this experiment using an alternative ISC marker, such as Lgr5.
Previous studies have indicated that dysregulation of the Notch signaling pathway can result in a reduction in the number of ISCs. While we did not perform a direct quantification of ISC numbers in our experiments, our Olfm4 staining—which serves as a reliable marker for ISCs—demonstrates a clear reduction in the number of positive cells in UGCG-CKO mice.
The increased Ki67 signal we observed reflects enhanced proliferation in the transit-amplifying region, and it does not directly indicate an increase in ISC number. Therefore, in UGCG-CKO mice, we observe a decrease in the number of ISCs, while there is an increase in transit-amplifying (TA) cells (progenitor cells). This increase in TA cells is probably a secondary consequence of the loss of barrier function associated with the UGCG knockout.
eLife Assessment
This paper reports a valuable finding that gastric fluid DNA content can be used as a potential biomarker for human gastric cancer. The evidence supporting the claims of the authors is solid, although an inclusion of explanations for the methodological limitations, moderate diagnostic performance, and the unexpected survival correlation would have strengthened the study. The work will be of interest to medical biologists working in the field of gastric cancer.
Reviewer #1 (Public review):
The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:
(1) This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.
(2) The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.
(3) The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.
(4) The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort.
(5) There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc.
(6) The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn
Reviewer #2 (Public review):
Summary:
The authors investigated whether the total DNA concentration in gastric fluid (gfDNA), collected via routine esophagogastroduodenoscopy (EGD), could serve as a diagnostic and prognostic biomarker for gastric cancer. In a large patient cohort (initial n=1,056; analyzed n=941), they found that gfDNA levels were significantly higher in gastric cancer patients compared to non-cancer, gastritis, and precancerous lesion groups. Unexpectedly, higher gfDNA concentrations were also significantly associated with better survival prognosis and positively correlated with immune cell infiltration. The authors proposed that gfDNA may reflect both tumor burden and immune activity, potentially serving as a cost-effective and convenient liquid biopsy tool to assist in gastric cancer diagnosis, staging, and follow-up.
Strengths:
This study is supported by a robust sample size (n=941) with clear patient classification, enabling reliable statistical analysis. It employs a simple, low-threshold method for measuring total gfDNA, making it suitable for large-scale clinical use. Clinical confounders, including age, sex, BMI, gastric fluid pH, and PPI use, were systematically controlled. The findings demonstrate both diagnostic and prognostic value of gfDNA, as its concentration can help distinguish gastric cancer patients and correlates with tumor progression and survival. Additionally, preliminary mechanistic data reveal a significant association between elevated gfDNA levels and increased immune cell infiltration in tumors (p=0.001).
Weaknesses:
The study has several notable weaknesses. The association between high gfDNA levels and better survival contradicts conventional expectations and raises concerns about the biological interpretation of the findings. The diagnostic performance of gfDNA alone was only moderate, and the study did not explore potential improvements through combination with established biomarkers. Methodological limitations include a lack of control for pre-analytical variables, the absence of longitudinal data, and imbalanced group sizes, which may affect the robustness and generalizability of the results. Additionally, key methodological details were insufficiently reported, and the ROC analysis lacked comprehensive performance metrics, limiting the study's clinical applicability.
Author response:
Public Reviews:
Reviewer #1 (Public review):
The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:
(1) This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.
We are grateful for this astute remark. A comparison of gfDNA concentration among the diagnostic groups indicates a trend of increasing values as the diagnosis progresses toward malignancy. The observed values for the diagnostic groups are as follows:
Author response table 1.
The chart below presents the statistical analyses of the same diagnostic/tumor-stage groups (One-Way ANOVA followed by Tukey’s multiple comparison tests). It shows that gastric fluid gfDNA concentrations gradually increase with malignant progression. We observed that the initial tumor stages (T0 to T2) exhibit intermediate gfDNA levels, which in this group is significantly lower than in advanced disease (p = 0.0036), but not statistically different from non-neoplastic disease (p = 0.74).
Author response image 1.
(2) The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.
We appreciate the attention to detail regarding the numbers analyzed in the manuscript. Importantly, the results are meaningful because the number of subjects in each group is comparable (T0-T2, N = 65; T3, N = 91; T4, N = 63). The mean gastric fluid gfDNA values (ng/µL) increase with disease stage (T0-T2: 15.12; T3-T4: 30.75), and both are higher than the mean gfDNA values observed in non-neoplastic disease (10.81 ng/µL for N+PD and 10.10 ng/µL for PN). These subject numbers in each diagnostic group accurately reflect real-world data from a tertiary cancer center.
(3) The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.
Histopathological analyses were performed throughout the study not only for the initial diagnosis of tissue biopsies, but also for the classification of Lauren’s subtypes, tumor staging, and the assessment of the presence and extent of immune cell infiltrates. Regarding the time of disease onset, this variable is inherently unknown--by definition--at the time of a diagnostic EGD. While the prognosis definition is indeed straightforward, we believe that a simple, cost-effective, and practical approach is advantageous for patients across diverse clinical settings and is more likely to be effectively integrated into routine EGD practice.
(4) The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort.
We wish to reinforce that EGD, along with conventional histopathology, remains the gold standard for gastric cancer evaluation. EGD under sedation is routinely performed for diagnosis, and the collection of gastric fluids for gfDNA evaluation does not affect patient comfort. Thus, while gfDNA analysis was evidently not intended as a diagnostic EGD and biopsy replacement, it may provide added prognostic value to this exam.
(5) There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc.
We are grateful for these comments and apologize for the clerical oversight. All figures, tables, titles and figure legends have now been double-checked.
(6) The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn.
We assume that the unusual wording remark regarding “overall logicality” pertains to the rationale and/or reasoning of this investigational study. Our working hypothesis was that during neoplastic disease progression, tumor cells continuously proliferate and, depending on various factors, attract immune cell infiltrates. Consequently, both tumor cells and immune cells (as well as tumor-derived DNA) are released into the fluids surrounding the tumor at its various locations, including blood, urine, saliva, gastric fluids, and others. Thus, increases in DNA levels within some of these fluids have been documented and are clinically meaningful. The concurrent observation of elevated gastric fluid gfDNA levels and immune cell infiltration supports the hypothesis that increased gfDNA—which may originate not only from tumor cells but also from immune cells—could be associated with better prognosis, as suggested by this study of a large real-world patient cohort.
In summary, we thank Reviewer #1 for his time and effort in a constructive critique of our work.
Reviewer #2 (Public review):
Summary:
The authors investigated whether the total DNA concentration in gastric fluid (gfDNA), collected via routine esophagogastroduodenoscopy (EGD), could serve as a diagnostic and prognostic biomarker for gastric cancer. In a large patient cohort (initial n=1,056; analyzed n=941), they found that gfDNA levels were significantly higher in gastric cancer patients compared to non-cancer, gastritis, and precancerous lesion groups. Unexpectedly, higher gfDNA concentrations were also significantly associated with better survival prognosis and positively correlated with immune cell infiltration. The authors proposed that gfDNA may reflect both tumor burden and immune activity, potentially serving as a cost-effective and convenient liquid biopsy tool to assist in gastric cancer diagnosis, staging, and follow-up.
Strengths:
This study is supported by a robust sample size (n=941) with clear patient classification, enabling reliable statistical analysis. It employs a simple, low-threshold method for measuring total gfDNA, making it suitable for large-scale clinical use. Clinical confounders, including age, sex, BMI, gastric fluid pH, and PPI use, were systematically controlled. The findings demonstrate both diagnostic and prognostic value of gfDNA, as its concentration can help distinguish gastric cancer patients and correlates with tumor progression and survival. Additionally, preliminary mechanistic data reveal a significant association between elevated gfDNA levels and increased immune cell infiltration in tumors (p=0.001).
Reviewer #2 has conceptually grasped the overall rationale of the study quite well, and we are grateful for their assessment and comprehensive summary of our findings.
Weaknesses:
(1) The study has several notable weaknesses. The association between high gfDNA levels and better survival contradicts conventional expectations and raises concerns about the biological interpretation of the findings.
We agree that this would be the case if the gfDNA was derived solely from tumor cells. However, the findings presented here suggest that a fraction of this DNA would be indeed derived from infiltrating immune cells. The precise determination of the origin of this increased gfDNA remains to be achieved in future follow-up studies, and these are planned to be evaluated soon, by applying DNA- and RNA-sequencing methodologies and deconvolution analyses.
(2) The diagnostic performance of gfDNA alone was only moderate, and the study did not explore potential improvements through combination with established biomarkers. Methodological limitations include a lack of control for pre-analytical variables, the absence of longitudinal data, and imbalanced group sizes, which may affect the robustness and generalizability of the results.
Reviewer #2 is correct that this investigational study was not designed to assess the diagnostic potential of gfDNA. Instead, its primary contribution is to provide useful prognostic information. In this regard, we have not yet explored combining gfDNA with other clinically well-established diagnostic biomarkers. We do acknowledge this current limitation as a logical follow-up that must be investigated in the near future.
Moreover, we collected a substantial number of pre-analytical variables within the limitations of a study involving over 1,000 subjects. Longitudinal samples and data were not analyzed here, as our aim was to evaluate prognostic value at diagnosis. Although the groups are imbalanced, this accurately reflects the real-world population of a large endoscopy center within a dedicated cancer facility. Subjects were invited to participate and enter the study before sedation for the diagnostic EGD procedure; thus, samples were collected prospectively from all consenting individuals.
Finally, to maintain a large, unbiased cohort, we did not attempt to balance the groups, allowing analysis of samples and data from all patients with compatible diagnoses (please see Results: Patient groups and diagnoses).
(3) Additionally, key methodological details were insufficiently reported, and the ROC analysis lacked comprehensive performance metrics, limiting the study's clinical applicability.
We are grateful for this useful suggestion. In the current version, each ROC curve (Supplementary Figures 1A and 1B) now includes the top 10 gfDNA thresholds, along with their corresponding sensitivity and specificity values (please see Suppl. Table 1). The thresholds are ordered from-best-to-worst based on the classic Youden’s J statistic, as follows:
Youden Index = specificity + sensitivity – 1 [Youden WJ. Index for rating diagnostic tests. Cancer 3:32-35, 1950. PMID: 15405679]. We have made an effort to provide all the key methodological details requested, but we would be glad to add further information upon specific request.
eLife Assessment
This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion. The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli, and the study involves perceptual reports from both humans and one monkey regarding whether there are one or two speeds in the stimulus. The study presents compelling evidence that (on average) MT neurons shift from faster-speed-takes-all at low speeds to representing the average of the two speeds at higher speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information could potentially be lost in an average response as described here, depending on assumptions about how MT activity is evaluated by other visual areas.
Reviewer #1 (Public review):
Summary:
Most studies in sensory neuroscience investigate how individual sensory stimuli are represented in the brain (e.g., the motion or color of a single object). This study starts tackling the more difficult question of how the brain represents multiple stimuli simultaneously and how these representations help to segregate objects from cluttered scenes with overlapping objects.
Strengths:
The authors first document the ability of humans to segregate two motion patterns based on differences in speed. Then they show that a monkey's performance is largely similar; thus establishing the monkey as a good model to study the underlying neural representations.
Careful quantification of the neural responses in the middle temporal area during the simultaneous presentation of fast and slow speeds leads to the surprising finding that, at low average speeds, many neurons respond as if the slowest speed is not present, while they show averaged responses at high speeds. This unexpected complexity of the integration of multiple stimuli is key to the model developed in this paper.
One experiment in which attention is drawn away from the receptive field supports the claim that this is not due to the involuntary capture of attention by fast speeds.
A classifier using the neuronal response and trained to distinguish single speed from bi-speed stimuli shows a similar overall performance and dependence on the mean speed as the monkey. This supports the claim that these neurons may indeed underlie the animal's decision process.
The authors expand the well-established divisive normalization model to capture the responses to bi-speed stimuli. The incremental modeling (eq 9 and 10) clarifies which aspects of the tuning curves are captured by the parameters.
Reviewer #3 (Public review):
Summary:
This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion.
Strengths:
The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli. The study presents compelling evidence that (on average) MT neurons shift from faster-speed-takes-all at low speeds to representing the average of the two speeds at higher speeds. An additional strength of the study is the inclusion of perceptual reports from both humans and one monkey participant performing a task in which they judged whether the stimuli involved one vs two different speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information is potentially lost in an average response as described here.
Reviewing Editor comment on revised version:
The remaining concern was resolved.
Author response:
The following is the authors’ response to the previous reviews
Reviewer #3 (Recommendations for the authors):
The authors have done an excellent job of addressing most comments, but my concerns about Figure 5 remain. I appreciate the authors' efforts to address the problem involving Rs being part of the computation on both the x and y axes of Figure 5, but addressing this via simulation addresses statistical significance but overlooks effect size. I think the authors may have misunderstood my original suggestion, so I will attempt to explain it better here. Since "Rs" is an average across all trials, the trials could be subdivided in two halves to compute two separate averages - for example, an average of the even numbered trials and an average of the odd numbered trials. Then you would use the "Rs" from the even numbered trials for one axis and the "Rs" from the odd numbered trials for the other. You would then plot R-Rs_even vs Rf-Rs_odd. This would remove the confound from this figure, and allow the text/interpretation to be largely unchanged (assuming the results continue to look as they do).
We have added a description and the result of the new analysis (line #321 to #332), and a supplementary figure (Suppl. Fig. 1) (line #1464 to #1477).
“We calculated 𝑅<sub>𝑠</sub> in the ordinate and abscissa of Figure 5A-E using responses averaged across different subsets of trials, such that 𝑅<sub>𝑠</sub> was no longer a common term in the ordinate and abscissa. For each neuron, we determined 𝑅<sub>𝑠1</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across half of the recorded trials, selected randomly. We also determined 𝑅<sub>𝑠2</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across the rest of the trials. We regressed (𝑅 − 𝑅<sub>𝑠1</sub> ) on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠2</sub>) , as well as (𝑅<sub>𝑠</sub> - 𝑅<sub>𝑠2</sub>) on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠1</sub>), and repeated the procedure 50 times. The averaged slopes obtained with 𝑅<sub>𝑠</sub> from the split trials showed the same pattern as those using 𝑅<sub>𝑠</sub> from all trials (Table 1 and Supplementary Fig. 1), although the coefficient of determination was slightly reduced (Table 1). For ×4 speed separation, the slopes were nearly identical to those shown in Figure 5F1. For ×2 speed separation, the slopes were slightly smaller than those in Figure 5F2, but followed the same pattern (Supplementary Fig. 1). Together, these analysis results confirmed the faster-speed bias at the slow stimulus speeds, and the change of the response weights as stimulus speeds increased.”
An additional remaining item concerns the terminology weighted sum, in the context of the constraint that wf and ws must sum to one. My opinion is that it is non-standard to use weighted sum when the computation is a weighted average, but as long as the authors make their meaning clear, the reader will be able to follow. I suggest adding some phrasing to explain to the reader the shift in interpretation from the more general weighted sum to the more constrained weighted average. Specifically, "weighted sum" first appears on line 268, and then the additional constraint of ws + wf =1 is introduced on line 278. Somewhere around line 278, it would be useful to include a sentence stating that this constraint means the weighted sum is constrained to be a weighted average.
Thanks for the suggestion. We have modified the text as follows. Since we made other modifications in the text, the line numbers are slightly different from the last version.
Line #274 to 275:
“Since it is not possible to solve for both variables, 𝑤<sub>𝑠</sub> and 𝑤<sub>𝑓</sub>, from a single equation (Eq. 5) with three data points, we introduced an additional constraint: 𝑤<sub>𝑠</sub> + 𝑤<sub>𝑓</sub> =1. With this constraint, the weighted sum becomes a weighted average.”
Also on line #309:
“First, at each speed pair and for each of the 100 neurons in the data sample shown in Figure 5, we simulated the response to the bi-speed stimuli (𝑅<sub>𝑒</sub>) as a randomly weighted average of 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> of the same neuron.
in which 𝑎 was a randomly generated weight (between 0 and 1) for 𝑅<sub>𝑓</sub>, and the weights for 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> summed to one.”
eLife Assessment
This paper presents the fundamental discovery that lipid metabolic imbalance induced by Snail, an EMT-related transcription factor, contributes to the acquisition of chemoresistance in cancer cells. The evidence, supported by a wide range of methods and adequate quantification, provides a convincing mechanistic explanation of how Snail drives ectopic expression of the cholesterol- and drug-efflux transporter ABCA1. This work, which introduces a novel therapeutic concept targeting invasive cancer, will be of broad interest to researchers in cancer biology, lipid metabolism, and cell biology.
Reviewer #1 (Public review):
The authors focus on the molecular mechanisms by which EMT cells confer resistance to cancer cells. The authors use a wide range of methods to reveal that overexpression of Snail in EMT cells induces cholesterol/sphingomyelin imbalance via transcriptional repression of biosynthetic enzymes involved in sphingomyelin synthesis. The study also revealed that ABCA1 is important for cholesterol efflux and thus for counterbalancing the excess of intracellular free cholesterol in these snail-EMT cells. Inhibition of ACAT, an enzyme catalyzing cholesterol esterification, also seems essential to inhibit the growth of snail-expressing cancer cells.
Overall, the provided data are convincing and enhance our knowledge on cancer biology.
Reviewer #2 (Public review):
Summary:
This revised study provides a clearer and more mechanistically grounded explanation of how lipid metabolic imbalance contributes to EMT-associated chemoresistance in renal cancer. In this study, the authors discovered that chemoresistance in RCC cell lines correlates with the expression levels of ABCA1 and the EMT-related transcription factor Snail. They demonstrate that Snail induces ABCA1 expression and chemoresistance, and that inhibition of ABCA1-associated pathways can counteract this resistance. The study also suggests that Snail disrupts the cholesterol-sphingomyelin balance by repressing enzymes involved in VLCFA-sphingomyelin synthesis, leading to excess free cholesterol and activation of the LXR-ABCA1 axis. Importantly, inhibiting cholesterol esterification, which renders free cholesterol inert, selectively suppresses growth of a xenograft model of Snail-positive kidney cancer. These findings provide potential lipid metabolism-targeting strategies for cancer therapy. The revised version includes additional quantitative analyses and new experiments addressing lipid balance and ABCA1 localization, further strengthening the overall mechanistic model.
Strengths:
This revised manuscript provides a more comprehensive and convincing mechanistic explanation for how Snail-driven EMT induces chemoresistance through altered lipid homeostasis. The study presents a novel concept in which the Chol/SM balance, rather than individual lipid levels, shapes therapeutic vulnerability. The potential for targeting cholesterol detoxification pathways in Snail-positive cancer cells remains a significant therapeutic implication. In the revised version, the authors provide additional quantitative analyses and complementary experiments - including ABCA1 localization, restoration of VLCFA-SM levels by supplementation with C22:0 ceramide, and membrane-order assays - which further strengthen the mechanistic interpretation and address key concerns raised in earlier reviews.
Weaknesses:
The revised version includes new experiments showing that restoring sphingomyelin levels suppresses ABCA1 expression, thereby strengthening the causal link between altered lipid balance and ABCA1 induction. However, the evidence that ABCA1 is directly required for chemoresistance remains somewhat limited, as the phenotype was not reproduced by ABCA1 knockout or knockdown, and CsA may affect additional targets beyond ABCA1.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
The authors focus on the molecular mechanisms by which EMT cells confer resistance to cancer cells. The authors use a wide range of methods to reveal that overexpression of Snail in EMT cells induces cholesterol/sphingomyelin imbalance via transcriptional repression of biosynthetic enzymes involved in sphingomyelin synthesis. The study also revealed that ABCA1 is important for cholesterol efflux and thus for counterbalancing the excess of intracellular free cholesterol in these snail-EMT cells. Inhibition of ACAT, an enzyme catalyzing cholesterol esterification, also seems essential to inhibit the growth of snail-expressing cancer cells.
However, It seems important to analyze the localization of ABCA1, as it is possible that in the event of cholesterol/sphingomyelin imbalance, for example, the intracellular trafficking of the pump may be altered.
The authors should also analyze ACAT levels and/or activity in snail-EMT cells that should be increased. Overall, the provided data are important to better understand cancer biology.
We thank the reviewer for recognizing the significance of our study. Consistent with the hypothesis that ABCA1 contributes to chemoresistance in hybrid E/M cells, we agree that demonstrating the localization of ABCA1 at the plasma membrane is important, and we have included additional experiments to address this point.
We also examined the expression of the major ACAT isoform in the kidney, SOAT1, across RCC cell lines. However, its expression did not correlate with that of Snail (Figure 4B), suggesting that SOAT1 is constitutively expressed at a certain level regardless of Snail expression. The details of these additional experiments are provided in the point-by-point responses below.
Reviewer #2 (Public review):
Summary:
In this study, the authors discovered that the chemoresistance in RCC cell lines correlates with the expression levels of the drug transporter ABCA1 and the EMT-related transcription factor Snail. They demonstrate that Snail induces ABCA1 expression and chemoresistance, and that ABCA1 inhibitors can counteract this resistance. The study also suggests that Snail disrupts the cholesterol-sphingomyelin (Chol/SM) balance by repressing the expression of enzymes involved in very long-chain fatty acid-sphingomyelin synthesis, leading to excess free cholesterol. This imbalance activates the cholesterol-LXR pathway, inducing ABCA1 expression. Moreover, inhibiting cholesterol esterification suppresses Snail-positive cancer cell growth, providing potential lipid-targeting strategies for invasive cancer therapy.
Strengths:
This research presents a novel mechanism by which the EMT-related transcription factor Snail confers drug resistance by altering the Chol/SM balance, introducing a previously unrecognized role of lipid metabolism in the chemoresistance of cancer cells. The focus on lipid balance, rather than individual lipid levels, is a particularly insightful approach. The potential for targeting cholesterol detoxification pathways in Snail-positive cancer cells is also a significant therapeutic implication.
Weaknesses:
The study's claim that Snail-induced ABCA1 is crucial for chemoresistance relies only on pharmacological inhibition of ABCA1, lacking additional validation. The causal relationship between the disrupted Chol/SM balance and ABCA1 expression or chemoresistance is not directly supported by data. Some data lack quantitative analysis.
We thank the reviewer for his/her insightful and constructive comments. In response, we have performed additional experiments using complementary approaches to further substantiate the contribution of Snail-induced ABCA1 expression to chemoresistance. Furthermore, to clarify the causal relationship between reduced sphingomyelin biosynthesis and ABCA1 expression, we conducted new experiments showing that supplementation with sphingolipids attenuates ABCA1 upregulation (Figure 3H). The details of these additional experiments are described in the point-by-point responses below.
Reviewer #1 (Recommendations for the authors):
In this paper, the authors reveal that snail expression in EMT-cells leads to an imbalance between cholesterol and sphingomyelin via a transcriptional repression of enzymes involved in the biosynthesis of sphingomyelin.
This paper is interesting and highlights how the imbalance of lipids would impact chemotherapy resistance. However, I have a few comments.
In Figure 2 in Eph4 cells, while filipin staining appears exclusively at the plasma membrane in the case of EpH4-snail cells filipin staining is also intracellular. It seems plausible that all filipin-positive intracellular staining is not exclusively in LDs, authors should therefore try to colocalize filipin with other intracellular markers. To this aim, authors might want to use topfluocholesterol-probe for instance.
We examined the distribution of TopFluor-cholesterol in hybrid E/M cells (Figure 2H) and found that TopFluor-cholesterol colocalizes with lipid droplets. In addition, we analyzed the colocalization between intracellular filipin signals and organelle-specific proteins, ADRP (lipid droplets) and LAMP1 (lysosomes) (Figure 2I). Since filipin binds exclusively to unesterified cholesterol, filipin signals did not colocalize with ADRP. Instead, we observed colocalization of filipin with LAMP1, suggesting that cholesterol accumulates in hybrid E/M cells in both esterified and unesterified forms.
In Figure 3, the authors reveal that the exogenous expression of the snail alters the ratio of cholesterol to sphingomyelin. The authors should reveal where is found the intracellular cholesterol and intracellular sphingomyelin within these cells Eph4-snail.
To investigate the lipid composition of the plasma membrane, we utilized lipid-binding protein probes, D4 (for cholesterol) and lysenin (for sphingomyelin) (Figures 2L and 2M). We found that the plasma membrane cholesterol content was not affected by EMT, whereas sphingomyelin levels were markedly decreased. In addition, intracellular cholesterol was visualized (Comment 1-1; Figures 2E–2K). On the other hand, because visualization of intracellular sphingomyelin is technically challenging, we were unable to include this analysis in the present study. We consider this an important direction for future investigation.
Regarding the model described in panel K of Figure 3. I would expect that the changes in lipid-membrane organization depicted in panel K should affect the pattern of GM1 toxin for instance or the motility of raft-associated proteins for instance. The authors could perform these experiments in order to sustain the change of lipid plasma membrane organization.
We attempted staining with FITC–cholera toxin to visualize GM1, but both EpH4 and EpH4–Snail cells exhibited very low levels of GM1, resulting in minimal or no detectable staining (data not shown). Instead, to assess the impact of decreased sphingomyelin on the overall biophysical properties of the plasma membrane, we used a plasma membrane–specific lipid-order probe, FπCM–SO₃ (Figures 2N–2P and Figure 2—figure supplement 3). We found that the plasma membrane of EpH4–Snail cells was more disordered (fluidized), suggesting that the overall properties of the plasma membrane are altered by ectopic expression of Snail.
Another issue is the intracellular localization of ABCA1 in Eph4-Snail cells. Knowing that a change in the cholesterol/sphingomyelin ratio can also modify intracellular protein trafficking, it seems important to analyze the intracellular localization of ABCA1 in EPh4-Snail cells.
We performed immunofluorescence microscopy for ABCA1 and found that ABCA1 was mainly localized at the plasma membrane in EpH4–Snail cells (Figure 1M).
As for the data on ACAT inhibition, we expect an increase in ACAT activity and protein levels in EMT cells overexpressing Snail. The authors should also investigate this point.
As noted in our response to the public review, we examined the expression of the major ACAT isoform in the kidney, SOAT1, across RCC cell lines. However, its expression did not correlate with Snail (Figure 4B), suggesting that SOAT1 is expressed at sufficient levels even in cells with low Snail expression. We agree that measuring ACAT activity would be important, as ACATs are regulated at multiple levels. However, we consider this to be beyond the scope of the present study and plan to address it in future work.
Minor comments
I do not understand why in the text, Figure S1 appears after Figure S2. The authors might want to change the numbering of these two figures.
We thank the reviewer for pointing this out. We have corrected the numbering of the supplementary figures so that Figure S1 now appears before Figure S2 in both the text and the revised figure legends.
Page 5, lane 20 Figure 1I instead of 1H.
Page 6, lane 2, Figure 1J instead of 1I, and lane 9 Figure 1H instead of 1I.
We thank the reviewer for carefully checking the figure references. We have corrected the figure numbering errors in the text as suggested.
Reviewer #2 (Recommendations for the authors):
For Figures 1B, 1H, 1J, 2B, 2C, 3G, S3A, and S3B, to enhance data reliability, it is necessary to conduct a quantitative analysis of the Western blot data. The average values from at least three biological replicates should be calculated, with statistical significance assessed.
We have conducted quantitative analyses of the Western blot data for Figures 1B, 1H, 1J, 2B, 2C, 3G, S3A, and S3B. Band intensities from at least three independent biological replicates were quantified, and the mean values with statistical significance are now presented in the revised figures.
For Figures 1D, 2A, 2D, and S2, the images of cells or tissues should not rely solely on selected fields. Quantitative analysis is required, and the mean values from at least three biological replicates should be provided with statistical significance testing.
We have performed quantitative analyses for Figures 1D, 2A, 2D, and S2. The quantification was based on data from at least three independent biological replicates, and the mean values with statistical significance are now included in the revised figures.
For Figures 1A, 1G, 4, and S5, evaluating ABCA1's involvement in drug resistance based solely on CsA treatment is insufficient. Demonstrating the loss of drug resistance through ABCA1 knockdown or knockout is necessary.
We generated ABCA1 knockout EpH4–Snail cells and examined their resistance to nitidine chloride. However, knockout of ABCA1 alone did not affect resistance to the compound (Figure 2 - figure supplement 2). This may be due to secondary metabolic alterations induced by ABCA1 loss or compensatory upregulation of other LXR-induced cholesterol efflux transporters. Instead, we demonstrated that treatment with the LXR inhibitor GSK2033 reduced the nitidine chloride resistance of EpH4–Snail cells (Figure 2C), supporting the idea that enhanced efflux of antitumor agents through the LXR–ABCA1–mediated cholesterol efflux pathway contributes to nitidine chloride resistance.
For Figure 3, to establish a causal relationship between changes in the Chol/SM balance and ABCA1 expression, it is important to test whether modifying cholesterol and SM levels to disrupt this balance affects ABCA1 expression.
Regarding causality, as shown in Figure 2, we have already demonstrated that reducing cholesterol levels in EpH4–Snail cells decreases ABCA1 expression. To further explore this relationship, we examined whether increasing sphingomyelin levels by adding ceramide to the culture medium—thereby restoring the sphingomyelin-to-cholesterol ratio—would reduce ABCA1 expression (Figure 3H). Indeed, supplementation with C22:0 ceramide decreased ABCA1 expression, suggesting that downregulation of the VLCFA-sphingomyelin biosynthetic pathway triggers ABCA1 upregulation. Collectively, these findings support a causal relationship between the Chol/SM balance and ABCA1 expression.
In Figure 3, if there is any information on differences in cholesterol affinity between LCFA-SM and VLCFA-SM, it would be beneficial to include it in the manuscript.
Differences in cholesterol affinity between LCFA-SM and VLCFA-SM in cellular membranes remain controversial and have yet to be fully elucidated. The decrease in cell surface sphingomyelin content, evaluated by lysenin staining (Figure 2L), was more pronounced than that of total sphingomyelin (Figure 3A). Given that VLCFA-SMs have been suggested to undergo distinct trafficking during recycling from endosomes to the plasma membrane (Koivusalo et al. Mol Biol Cell 2007), their reduction may lead to decreased plasma membrane sphingomyelin content by altering its intracellular distribution. We have added this discussion to the revised manuscript.
In Figure 3F, it is recommended to assess housekeeping gene expression as a control. Quantitative real-time PCR should be performed, and the average values from at least three biological replicates should be presented.
We have performed quantitative RT-PCR analysis. The average values from at least three independent biological replicates are presented in Figure 3G.
For Figure 3F, to show whether the reduction of CERS3 or ELOVL7 affects the Chol/SM balance and ABCA1 expression, it is necessary to investigate the phenotypes following the knockdown or knockout of these enzymes.
We fully agree that phenotypic analyses of epithelial cells lacking CerS3 or ELOVL7 would provide valuable insights. However, we consider such investigations to be beyond the scope of the present study and plan to pursue them in future work.
Clarifying whether similar phenotypes are induced by other EMT-related transcription factors, or if they are specific to Snail, would be beneficial.
We agree that examining whether similar phenotypes are induced by other EMT-related transcription factors would be highly valuable for understanding the broader EMT network. However, as the focus of the present study is on lipid metabolic alterations associated with EMT—particularly the imbalance between sphingomyelin and cholesterol—we consider this investigation to be beyond the scope of the current work and plan to address it in future studies.
There are errors in figure citations within the text that need correction:
p.9 l.18 Fig. 3D → Fig. 3G
p.9 l.22 Fig. 3I → Fig. 3H
p.9 l.23 Fig. S2 → Fig. S4
p.10 l.6 Fig. 3J → Fig. 1J
p.10 l.8 Fig. 3J → Fig. 1J
p.10 l.9 Fig. 3K → Fig. 3I
p.10 l.12 Fig. 3H → Fig. 3J
p.10 l.14 Fig. 2D and Fig. S4 → Fig. 2G and Fig. S4D
We thank the reviewer for carefully pointing out these citation errors. We have corrected all figure references in the text as suggested.
eLife Assessment
This study reports the important development and characterization of next-generation analogs of the molecule AA263, which was previously identified for its ability to promote adaptive ER proteostasis remodeling. The evidence supporting the conclusions is convincing, with rigorous assays used to benchmark the changes in potency and efficacy of the AA263 analogs as well as AA263 targets. The ability of AA263 analogs to restore the loss of function associated with disease-associated proteins prone to misfolding will be of interest to pharmacologists, chemical biologists, and cell biologists, as well as those working on protein misfolding disorders.
Reviewer #1 (Public review):
Summary:
This study builds off prior work that focused on the molecule AA147 and its role as an activator of the ATF6 arm of the unfolded protein response. In prior manuscripts, AA147 was shown to enter the ER, covalently modify a subset of protein disulfide isomerases (PDIs), and improve ER quality control for the disease-associated mutants of AAT and GABAA. Unsuccessful attempts to improve the potency of AA147 have led the authors to characterize a second hit from the screen in this study: the phenylhydrazone compound AA263. The focus of this study on enhancing biological activity of the AA147 molecule is compelling, and overcomes a hurdle of the prior AA147 drug that proved difficult to modify. The study successfully identifies PDIs as a shared cellular target of AA263 and its analogs. The authors infer, based on the similar target hits previously characterized for AA147, that PDI modification likely accounts for a mechanism of action for AA263.
Strengths:
The work establishes the ability to modify the AA263 molecule to create analogs with more potency and efficacy for ATF6 activation. The "next generation" analogs are able to enhance the levels of functional AAT and GABAA receptors in cellular models expressing the Z-variant of AAT or an epilepsy-associated variant of the GABAA receptor, outlining the therapeutic potential for this molecule and laying the foundation for future organism-based studies.
The authors are able to establish that like AA147, AA263 covalently targets ER PDIs. While it is a likely mechanism that AA263 works through the PDIs, the authors are careful to discuss that this is a potential mechanism that remains to be explicitly proven. The study provides the foundation for future work to further define a role for the PDIs in the actions of AA263.
Reviewer #2 (Public review):
Modulating the UPR by pharmacological targeting of its sensors (or regulators) provides mostly uncharted opportunities in diseases associated with protein misfolding in the secretory pathway. Spearheaded by the Kelly and Wiseman labs, ATF6 modulators were developed in previous years that act on ER PDIs as regulators of ATF6. However, hurdles in their medicinal chemistry have hampered further developments. In this study, the authors provide evidence that the small molecule AA263 also targets and covalently modifies ER PDIs with the effect of activating ATF6. Importantly, AA263 turned out to be amenable to chemical optimization while maintaining its desired activity. Building on this, the authors show that AA263 derivatives can improve aggregation, trafficking and function of two disease-associated mutants of secretory pathway proteins. Together, this study provides compelling evidence for AA263 (and its derivatives) being interesting modulators of ER proteostasis. Mechanistic details of its mode of action will need more attention in future studies that can now build on this.
In detail, the authors provide strong evidence that AA263 covalently binds to ER PDIs, which will inhibit the protein disulfide isomerase activity. ER PDIs regulate ATF6, and thus their finding provides a mechanistic interpretation of AA263 activating the UPR. It should be noted, however, that AA263 shows broad protein labeling (Fig. 1G) which may suggest additional targets, beyond the ones defined as MS hits in this study. Also, a further direct analysis of the IRE1 and PERK pathways (activated or not by AA263) may be an interesting future directions, as e.g. PDIA1, a target of AA263, directly regulates IRE1 (Yu et al., EMBOJ, 2020) and other PDIs also act on PERK and IRE1. The authors interpret modest activation of IRE1/PERK target genes (Fig. 2C) as an effect on target gene overlap, indeed the most likely explanation based on their selective analyses on IRE1 (ERdj4) and PERK (CHOP) downstream genes, but direct activation due to the targeting of their PDI regulators is also a possible explanation. Further key findings of this paper are the observed improvement of AAT behavior and GABAA trafficking and function. Further strength to the mechanistic conclusion that ATF6 activation causes this could be obtained by using ATF6 inhibitors/knockouts in the presence of AA263 (as the target PDIs may directly modulate behavior of AAT and/or GABAA). Along the same line, it also warrants further investigation in future studies why the different compounds, even if all were used at concentrations above their EC50, had different rescuing capacities on the clients.
Together, the study now provides a strong basis for such in-depth mechanistic analyses.
Reviewer #3 (Public review):
Summary:
This study aims to develop and characterize phenylhydrazone-based small molecules that selectively activate the ATF6 arm of the unfolded protein response by covalently modifying a subset of ER-resident PDIs. The authors identify AA263 as a lead scaffold and optimize its structure to generate analogs with improved potency and ATF6 selectivity, notably AA263-20. These compounds are shown to restore proteostasis and functional expression of disease-associated misfolded proteins in cellular models involving both secretory (AAT-Z) and membrane (GABAA receptor) proteins. The findings provide valuable chemical tools for modulating ER proteostasis and may serve as promising leads for therapeutic development targeting protein misfolding diseases.
Strengths:
The study presents a well-defined chemical biology framework integrating proteomics, transcriptomics, and disease-relevant functional assays.
Identification and optimization of a new electrophilic scaffold (AA263) that selectively activates ATF6 represents a valuable advance in UPR-targeted pharmacology.
SAR studies are comprehensive and logically drive the development of more potent and selective analogs such as AA263-20.
Functional rescue is demonstrated in two mechanistically distinct disease models of protein misfolding-one involving a secretory protein and the other a membrane protein-underscoring the translational relevance of the approach.
Weaknesses:
ATF6 activation is primarily inferred from reporter assays and transcriptional profiling; direct biochemical evidence of ATF6 cleavage or nuclear translocation remains missing. However, the authors have added supporting data showing that co-treatment with the ATF6 inhibitor CP7 suppresses target gene induction, which partially strengthens the evidence for ATF6-dependent activity.
Although the proposed mechanism involving PDI modification and ATF6 activation is plausible, it is still not experimentally demonstrated and remains incompletely characterized.
In vivo validation is absent, and thus the pharmacological feasibility, selectivity, and bioavailability of these compounds in physiological systems remain untested.
Comments on revisions:
The authors have generally addressed my comments.
Author response:
The following is the authors’ response to the previous reviews.
Reviewer #1 (Public review):
Summary:
This study builds off prior work that focused on the molecule AA147 and its role as an activator of the ATF6 arm of the unfolded protein response. In prior manuscripts, AA147 was shown to enter the ER, covalently modify a subset of protein disulfide isomerases (PDIs), and improve ER quality control for the disease-associated mutants of AAT and GABAA. Unsuccessful attempts to improve the potency of AA147 have led the authors to characterize a second hit from the screen in this study: the phenylhydrazone compound AA263. The focus of this study on enhancing the biological activity of the AA147 molecule is compelling, and overcomes a hurdle of the prior AA147 drug that proved difficult to modify. The study successfully identifies PDIs as a shared cellular target of AA263 and its analogs. The authors infer, based on the similar target hits previously characterized for AA147, that PDI modification accounts for a mechanism of action for AA263.
Strengths:
The authors are able to establish that, like AA147, AA263 covalently targets ER PDIs. The work establishes the ability to modify the AA263 molecule to create analogs with more potency and efficacy for ATF6 activation. The "next generation" analogs are able to enhance the levels of functional AAT and GABAA receptors in cellular models expressing the Z-variant of AAT or an epilepsy-associated variant of the GABAA receptor, outlining the therapeutic potential for this molecule and laying the foundation for future organism-based studies.
We thank the reviewer for the positive comments on our manuscript. We address the reviewers remaining comments on our work, as described below.
Weaknesses:
Arguably, the work does not fully support the statement provided in the abstract that the study "reveals a molecular mechanism for the activation of ATF6". The identification of targets of AA263 and its analogs is clear. However, it is a presumption that the overlap in PDIs as targets of both AA263 and AA147 means that AA263 works through the PDIs. While a likely mechanism, this conclusion would be bolstered by establishing that knockdown of the PDIs lessens drug impact with respect to ATF6 activation.
We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147 (see Paxman et al (2018) ELIFE). However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating the redox state of ATF6. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling.
To address this comment, we have tempered our language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity, as highlighted below:
“Page 7, Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup>.[38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”
Alternatively, it has previously been suggested that the cell-type dependent activity of AA263 may be traced to the presence of cell-type specific P450s that allow for the metabolic activation of AA263 or cell-type specific PDIs (Plate et al 2016; Paxman et al 2018). If the PDI target profile is distinct in different cell types, and these target difference correlates with ATF6-induced activity by AA263, that would also bolster the authors' conclusion.
As highlighted by the reviewer, different ER oxidases (e.g., P450s) could differentially influence activation of compounds such as AA263 to promote PDI modification and subsequent ATF6 activation. The specific ER oxidases responsible for AA263 activation are currently unknown; however, we anticipate that multiple different enzymes can promote this activity making it difficult to discern the specific contributions of any one oxidase. We have made this point clearer in the revised submission, as below:
Page 7, Line 169: “This specificity for ER proteins instead suggests the localized generation of AA263 quinone methides at the ER membrane, likely through metabolic activation by different ER localized oxidases, which has been previously been shown to contribute to the selective modification of ER proteins afforded by other compounds such as AA147 [49]”
Reviewer #2 (Public review):
Modulating the UPR by pharmacological targeting of its sensors (or regulators) provides mostly uncharted opportunities in diseases associated with protein misfolding in the secretory pathway. Spearheaded by the Kelly and Wiseman labs, ATF6 modulators were developed in previous years that act on ER PDIs as regulators of ATF6. However, hurdles in their medicinal chemistry have hampered further development. In this study, the authors provide evidence that the small molecule AA263 also targets and covalently modifies ER PDIs, with the effect of activating ATF6. Importantly, AA263 turned out to be amenable to chemical optimization while maintaining its desired activity. Building on this, the authors show that AA263 derivatives can improve the aggregation, trafficking, and function of two disease-associated mutants of secretory pathway proteins. Together, this study provides compelling evidence for AA263 (and its derivatives) being interesting modulators of ER proteostasis. Mechanistic details of its mode of action will need more attention in future studies that can now build on this.
We thank the reviewer for their positive comments on our manuscript. We address the reviewer’s specific queries on our work, as outlined below.
In detail, the authors provide strong evidence that AA263 covalently binds to ER PDIs, which will inhibit the protein disulfide isomerase activity. ER PDIs regulate ATF6, and thus their finding provides a mechanistic interpretation of AA263 activating the UPR. It should be noted, however, that AA263 shows broad protein labeling (Figure 1G), which may suggest additional targets, beyond the ones defined as MS hits in this study.
This is true. We do show broad proteome-wide labeling with AA263<sup>yne</sup>, which are largely reflected in the hits identified by MS beyond PDI family members. It is possible that other observed engaged targets, in addition to PDIs, may contribute to the activation of ATF6 signaling. Regardless, our MS analysis clearly shows that the compounds modified by AA263 are enriched for PDIs, further supporting our model whereby AA263-dependent PDI modification is likely responsible for ATF6 activation.
Also, a further direct analysis of the IRE1 and PERK pathways (activated or not by AA263) would have been a benefit, as e.g., PDIA1, a target of AA263, directly regulates IRE1 (Yu et al., EMBOJ, 2020), and other PDIs also act on PERK and IRE1. The authors interpret modest activation of IRE1/PERK target genes (Figure 2C) as an effect on target gene overlap, indeed the most likely explanation based on their selective analyses on IRE1 (ERdj4) and PERK (CHOP) downstream genes, but direct activation due to the targeting of their PDI regulators is also a possible explanation.
While we do observe mild increases in IRE1/XBP1s target genes, we do not observe significant increases in PERK/ISR target genes in cells treated with optimized AA263 analogs (see Fig. 2C). We previously showed that genetic ATF6 activation leads to a modest increase in IRE1/XBP1s target genes, reflecting the overlap in target genes of the IRE1/XBP1s and ATF6 pathways (see Shoulders et al (2013) Cell Reports). However, with our data, we cannot explicitly rule out the possibility that the mild increase in IRE1/XBP1s target genes reflects direct IRE1/XBP1s activation, as suggested by the reviewer. To address this, we have adapted the text to highlight this point, now specifically referring to preferential ATF6 activation afforded by these compounds, as below:
Page 5, Line 100: “In addition to finding AA147, our original high-throughput screen also identified the phenylhydrazone compound AA263 as a compound that preferentially activates the ATF6 arm of the UPR [26]”
Further key findings of this paper are the observed improvement of AAT behavior and GABAA trafficking and function. Further strength to the mechanistic conclusion that ATF6 activation causes this could be obtained by using ATF6 inhibitors/knockouts in the presence of AA263 (as the target PDIs may directly modulate the behavior of AAT and/or GABAA).
AA263 and related compounds could influence ER proteostasis of destabilized proteins through multiple mechanisms including ATF6 activation or direct modification of a subset of PDIs. We previously showed that AA263-dependent enhancement of A1AT-Z secretion and activity can be largely attributed to ATF6 activation (see Sun et al (2023) Cell Chem Biol). In the revised submission, we now show that increased levels of g2(R177G) afforded by treatment with AA263<sup>yne</sup> are partially blocked by co-treatment with the ATF6 inhibitor Ceapin-A7 (CP7), highlighting the contributions of ATF6 activation for this phenotype (Fig. S5B,C). Intriguingly, this result also demonstrates the benefit for targeting ER proteostasis using compounds such as our optimized AA263 analogs, as this approach allows us to enhance ER proteostasis of destabilized proteins through multiple mechanisms. We further expand on this specific point in the revised manuscript as below:
Page 14, Line 375: “AA263 and its related analogs can influence ER proteostasis in these models through different mechanisms including ATF6-dependent remodeling of ER proteostasis and direct alterations to the activity of specific PDIs.(*) Consistent with this, we show that pharmacologic inhibition of ATF6 only partially blocks increases of g2(R177G) afforded by treatment with AA263<sup>yne</sup>, highlighting the benefit for targeting multiple aspects of ER proteostasis to enhance ER proteostasis of this diseaserelevant GABA<sub>A</sub> variant. While additional studies are required to further deconvolute the relative contributions of these two mechanisms on the protection afforded by our optimized compounds, our results demonstrate the potential for these compounds to enhance ER proteostasis in the context of different protein misfolding diseases.”
Along the same line, it also warrants further investigation why the different compounds, even if all were used at concentrations above their EC50, had different rescuing capacities on the clients.
This is an interesting question that we are continuing to study. While in general, we observe fairly good correlation between ATF6 activation and correction of diseases of ER proteostasis linked to proteins such as A1AT-Z or GABA<sub>A</sub> receptors, as the reviewer points out, we do find some compounds are more efficient at correcting proteostasis than others activate ATF6 to similar levels. We attribute this to differences in either labeling efficiency of PDIs or differential regulation of various ER proteostasis factors, although that remains to be further defined. As we continue working with these (and other) compounds, we will focus on defining a more molecular basis for these findings.
Together, the study now provides a strong basis for such in-depth mechanistic analyses.
We agree and we are continuing to pursue the mechanistic basis of ER proteostasis remodeling afforded by these and related compounds.
Reviewer #3 (Public review):
Summary:
This study aims to develop and characterize phenylhydrazone-based small molecules that selectively activate the ATF6 arm of the unfolded protein response by covalently modifying a subset of ER-resident PDIs. The authors identify AA263 as a lead scaffold and optimize its structure to generate analogs with improved potency and ATF6 selectivity, notably AA263-20. These compounds are shown to restore proteostasis and functional expression of disease-associated misfolded proteins in cellular models involving both secretory (AAT-Z) and membrane (GABAA receptor) proteins. The findings provide valuable chemical tools for modulating ER proteostasis and may serve as promising leads for therapeutic development targeting protein misfolding diseases.
Strengths:
(1) The study presents a well-defined chemical biology framework integrating proteomics, transcriptomics, and disease-relevant functional assays.
(2) Identification and optimization of a new electrophilic scaffold (AA263) that selectively activates ATF6 represents a valuable advance in UPR-targeted pharmacology.
(3) SAR studies are comprehensive and logically drive the development of more potent and selective analogs such as AA263-20.
(4) Functional rescue is demonstrated in two mechanistically distinct disease models of protein misfolding-one involving a secretory protein and the other a membrane protein-underscoring the translational relevance of the approach.
We thank the reviewer for their positive comments related to our work. We address specific weaknesses highlighted by the reviewer, as outlined below.
Weaknesses:
(1) ATF6 activation is primarily inferred from reporter assays and transcriptional profiling; however, direct evidence of ATF6 cleavage is lacking.
While ATF6 trafficking and processing can be visualized in cell culture models following severe ER insults (e.g., Tg, Tm), we showed previously that the more modest activation afforded by pharmacologic activators such as AA147 and AA263 cannot be easily visualized by monitoring ATF6 processing (see Plate et al (2016) ELIFE). As we have shown in numerous other manuscripts, we have established a transcriptional profiling approach that accurately defines ATF6 activation. We use that approach to confirm preferential ATF6 activation in this manuscript. We feel that this is sufficient for confirming ATF6 activation. However, we also now include data showing that co-treatment with ATF6 inhibitors (e.g., CP7) blocks increased expression of ATF6 target genes induced by our prioritized compound AA263<sup>yne</sup> (Fig. S1B). This further supports our assertion that this compound activates ATF6 signaling.
(2) While the mechanism involving PDI modification and ATF6 activation is plausible, it remains incompletely characterized.
We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147. However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating ATF6 redox. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling.
To address this comment, we have tempered out language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity, as highlighted below:
Page 7, Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup>[38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”
(3) No in vivo data are provided, leaving the pharmacological feasibility and bioavailability of these compounds in physiological systems unaddressed.
We are continuing to test the in vivo activity of these compounds in work outside the scope of this initial study.
Reviewer #1 (Recommendations for the authors):
(1) First page of the discussion, last sentence. "We previously showed the relatively labeling of PDI modification directly impacts..." should be reworded.
Thank you. We have corrected this in the revised manuscript.
(2) What is the rationale for measuring ERSE-Fluc activity at 18 h but RNAseq at 6 h? What is known about the timing of action for AA263?
Compound-dependent activation of luciferase reporters requires the translation and accumulation of the luciferase protein for sufficient signal, while qPCR does not. We normally use longer incubations for reporter assays to ensure that we have sufficient quantity of reporter protein to accurately monitor activation. We have found that AA263 can rapidly increase ATF6 activity, with gene expression increases being observed after only a few hours of treatment. This is consistent with the proposed mechanism of ATF6 activation discussed herein involving metabolic activation and subsequent PDI modification.
(3) Figure 1 panel E and Figure S2 panel B. Are these the same data for AA263 and AA263yne, with the AA2635 added to the plot for Figure S2? If so, it would be nice to note that panel B represents data from 3 of the replicates that are shown in Figure 1 (n=6).
Yes. The AA263 and AA263<sup>yne</sup> data shown in Fig. 1E and Fig. S2B are the same data, as these experiments were performed at the same time. We apologize for this oversight, which has now been corrected in the revised version. Note that there were n=3 replicates for the dose response shown in Fig. 1E, which we corrected in the figure legend as below:
Fig. S2B Figure Legend: “B. Activation of the ERSE-FLuc ATF6 reporter in HEK293T cells treated for 18 h with the indicated concentration of AA263, AA263<sup>yne</sup>, or AA263-5. Error bars show SEM for n= 3 replicates. The data for AA263 and AA263<sup>yne</sup> is the same as that shown in Fig. 1E and are shown for comparison.”
(4) Figure S3. The legend notes 5 µM AA263-yne and 20 µM analog, whereas the figure itself outlines the same ratio but different concentrations: 10 µM and 40 µM.
We apologize for this mistake in the legend, which has been corrected. The information in the figure is correct.
Reviewer #2 (Recommendations for the authors):
(1) The activation mechanism of ATF6 is still debated (really trafficking as a monomer?); the authors may want to word more carefully here.
We agree. We have corrected this in the revised manuscript to indicate that increased populations of reduced ATF6 traffic for proteolytic processing.
(2) In Figure 1B, below the figure, mM is written for BME, but micromolar is meant.
Thank you. This has been corrected in the revised manuscript.
(3) The authors may want to make clearer, why BME does not completely inhibit AA263 and does not cause ER stress itself under the conditions tested.
The addition of BME in our experiments is designed to shift the redox potential of the cell to increase intracellular thiol reagents, such as glutathione, that can quench ‘activated’ AA263 and its analogs. However, BME is actively being oxidized upon addition and the intracellular redox environment can rapidly equilibrate following BME addition. Thus, we do not expect that AA263 or other metabolically activated compounds will be fully quenched using this approach, as is observed. This is consistent with other experiments where we show that the use of these types of reducing agents do not fully suppress the activity of reactive molecules, instead shifting their dosedependent activation of specific pathways.
(4) The data in Figure 4C seems to disagree with the other data on the tested compounds; this should be clarified.
It is unclear to what the reviewer is referring. The data in 4C shows that treatment with our optimized AA263 analogs improved elastase inhibition afforded by secreted A1AT, as would be predicted.
(5) PDIs that have been shown to regulate ATF6 should be discussed in more detail in the light of the presented data/interactome (e.g., ERp18).
Thank you for the suggestion. We now explicitly note that AA263<sup>yne</sup> covalent modifies TXNDC12/ERP18 in our proteomic dataset. However, we also note that there is no difference in labeling of this specific PDI between AA263<sup>yne</sup> and AA132<sup>yne</sup>. This may indicate that the targeting of this protein is responsible for the larger levels of ATF6 activation afforded by both these compounds relative to AA147, with the activation of other UPR pathways afforded by AA132 resulting from increased labeling of other PDIs. We are now exploring this possibility in work outside the scope of this current manuscript.
Page 7 Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup> [38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”
Reviewer #3 (Recommendations for the authors):
(1) Please consider adding detection of ATF6 cleavage by Western blot as direct evidence of AA263-induced ATF6 activation, to substantiate the central mechanistic claim.
While ATF6 trafficking and processing can be visualized in cell culture models following severe ER insults (e.g., Tg, Tm), we showed previously that the more modest activation afforded by pharmacologic activators such as AA147 and AA263 cannot be easily visualized through monitoring ATF6 proteolytic processing by western blotting (see Plate et al (2016) ELIFE). As we have shown in numerous other manuscripts, we have established a transcriptional profiling approach that accurately defines ATF6 activation. We use that approach to confirm preferential ATF6 activation in this manuscript. We feel that this is sufficient for confirming ATF6 activation. However, we also now include qPCR data showing that co-treatment with ATF6 inhibitors (e.g., CP7) blocks increased expression of ATF6 target genes induced by our prioritized compounds.
(2) To strengthen causal inference, loss-of-function experiments such as PDI knockdown, cysteine mutant inactivation, or reconstitution studies may be informative.
We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147. However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating ATF6 redox state rather than a single PDI family member. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling.
To address this comment, we have tempered out language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity.
(3) Since β-mercaptoethanol inhibits ATF6 activation, it would be helpful to examine whether DTT also suppresses the activity of AA263 or its analogs, to clarify the redox sensitivity of the mechanism.
The use of reducing agents stronger than BME, such as DTT, globally activates the UPR, including the ATF6 arm of the UPR. Thus, we are unable to perform the requested experiments. We specifically use BME because it is a sufficiently mild reducing agent that can quench reactive metabolites (e.g., activated AA263 analogs) through alterations in cellular glutathione levels without globally activating the UPR.
(4) Given the electrophilic nature of AA263, which may allow it to react with endogenous thiols (e.g., glutathione or cysteine), a brief discussion or experimental validation of this potential liability would enhance the interpretation of in vivo applicability.
Metabolically activated AA263, like AA147, can be quenched by endogenous thiols such as glutathione. However, treatment with our metabolically activatable electrophiles AA147 and AA263 , either in vitro or in vivo, does not seem to induce activation of the NRF2-regulated oxidative stress response (OSR) in the cell lines used in this manuscript (e.g., Fig. S2C). This suggests that treatment with these compounds does not globally disrupt the intracellular redox state, at least in the tested cell lines. While AA147 has been shown to activate NRF2 in specifical neuronal cell lines and in primary neurons, AA147 does not activate NRF2 signaling in other nonneuronal cell lines or other tissues (see Rosarda et al (2021) ACS Chem Bio). We are currently testing the potential for AA263 to similarly activate adaptive NRF2 signaling in neuronal cells. Regardless, AA147, which functions through a similar mechanism to that proposed for AA263, has been shown to be beneficial in multiple models of disease both in vitro and in vivo. This indicates that this mechanism of action is suitable for continued translational development to mitigate pathologic ER proteostasis disruption observed in diverse types of human disease.
(5) Evaluation of in vivo activity, such as BiP induction in the liver following intraperitoneal administration of AA263-20 or related analogs, could substantially increase the translational impact of the work.
We are continuing to probe the activity of our optimized AA263 analogs in vivo in work outside the scope of this current manuscript. We thank the reviewer for this suggestion.
(6) The degree of BiP induction may also be contextualized by comparison with known ER stress inducers such as thapsigargin or tunicamycin, ideally by providing relative dose-equivalent responses.
We are not sure to what the reviewer is referring. We show comparative activation of ATF6 in cells treated with the ER stressor Tg and our compounds by both reporter assay (e.g., Fig. 2B) and qPCR of the ATF6 target gene BiP (HSPA5) (Fig. S2A). We feel that this provides context for the more physiologic levels of ATF6 activation afforded by these compounds.
eLife Assessment
This useful study develops an individual-based model to investigate the evolution of division of labor in vertebrates, comparing the contributions of group augmentation and kin selection. The model incorporates several biologically relevant features, including age-dependent task switching and separate manipulation of relatedness and group-size benefits. However, the evidence remains incomplete to support the authors' central claim that group augmentation is the primary driver of vertebrate division of labor. Key modelling assumptions, such as limited opportunities for task synergy, the structure of helper and floater dynamics, and the relatively narrow parameter space explored, continue to restrict the potential for kin selection to produce division of labor, thereby limiting the generality of the conclusions.
Reviewer #2 (Public review):
Summary:
This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. The model considers a population subdivided in groups, each group has a single asexually-reproducing breeder, other group members (subordinates) can perform two types of tasks called "work" or "defense", individuals have different ages, individuals can disperse between groups, each individual has a dominance rank that increases with age, and upon death of the breeder a new breeder is chosen among group members depending on their dominance. "Workers" pay a reproduction cost by having their dominance decreased, and "defenders" pay a survival cost. Every group member receives a survival benefit with increasing group size. There are 6 genetic traits, each controlled by a single locus, that control propensities to help and disperse, and how task choice and dispersal relate to dominance. To study the effect of group augmentation without kin selection, the authors cross-foster individuals to eliminate relatedness. The paper allows for the evolution of the 6 genetic traits under some different parameter values to study the conditions under which division of labour evolves, defined as the occurrence of different subordinates performing "work" and "defense" tasks. The authors envision the model as one of vertebrate division of labor.
The main conclusion of the paper is that group augmentation is the primary factor causing the evolution of vertebrate division of labor, rather than kin selection. This conclusion is drawn because, for the parameter values considered, when the benefit of group augmentation is set to zero, no division of labor evolves and all subordinates perform "work" tasks but no "defense" tasks.
Strengths:
The model incorporates various biologically realistic details, including the possibility to evolve age polytheism where individuals switch from "work" to "defence" tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.
Weaknesses:
The model and its analysis are limited, which in my view makes the results insufficient to reach the main conclusion that group augmentation and not kin selection is the primary cause of the evolution of vertebrate division of labour. There are several reasons.
First, although the main claim that group augmentation drives the evolution of division of labour in vertebrates, the model is rather conceptual in that it doesn't use quantitative empirical data that applies to all/most vertebrates and vertebrates only. So, I think the approach has a conceptual reach rather than being able to achieve such a conclusion about a real taxon.
Second, I think that the model strongly restricts the possibility that kin selection is relevant. The two tasks considered essentially differ only by whether they are costly for reproduction or survival. "Work" tasks are those costly for reproduction and "defense" tasks are those costly for survival. The two tasks provide the same benefits for reproduction (eqs. 4, 5) and survival (through group augmentation, eq. 3.1). So, whether one, the other, or both helper types evolve presumably only depends on which task is less costly, not really on which benefits it provides. As the two tasks give the same benefits, there is no possibility that the two tasks act synergistically, where performing one task increases a benefit (e.g., increasing someone's survival) that is going to be compounded by someone else performing the other task (e.g., increasing that someone's reproduction). So, there is very little scope for kin selection to cause the evolution of labour in this model. Note synergy between tasks is not something unusual in division of labour models, but is in fact a basic element in them, so excluding it from the start in the model and then making general claims about division of labour is unwarranted. In their reply, the authors point out that they only consider fertility benefits as this, according to them, is what happens in cooperative breeders with alloparental care; however, alloparental care entails that workers can increase other's survival *without group augmentation*, such as via workers feeding young or defenders reducing predator-caused mortality, as a mentioned in my previous review but these potentially kin-selected benefits are not allowed here.
Third, the parameter space is understandably little explored. This is necessarily an issue when trying to make general claims from an individual-based model where only a very narrow parameter region of a necessarily particular model can be feasibly explored. As in this model the two tasks ultimately only differ by their costs, the parameter values specifying their costs should be varied to determine their effects. In the main results, the model sets a very low survival cost for work (yh=0.1) and a very high survival cost for defense (xh=3), the latter of which can be compensated by the benefit of group augmentation (xn=3). Some limited variation of xh and xn is explored, always for very high values, effectively making defense unevolvable except if there is group augmentation. In this revision, additional runs have been included varying yh and keeping xh and xn constant (Fig. S6), so without addressing my comment as xn remains very high. Consequently, the main conclusion that "division of labor" needs group augmentation seems essentially enforced by the limited parameter exploration, in addition to the second reason above.
Fourth, my view is that what is called "division of labor" here is an overinterpretation. When the two helper types evolve, what exists in the model is some individuals that do reproduction-costly tasks (so-called "work") and survival-costly tasks (so-called "defense"). However, there are really no two tasks that are being completed, in the sense that completing both tasks (e.g., work and defense) is not necessary to achieve a goal (e.g., reproduction). In this model there is only one task (reproduction, equation 4,5) to which both helper types contribute equally and so one task doesn't need to be completed if completing the other task compensates for it; instead, it seems more fitting to say that there are two types of helpers, one that pays a fertility cost and another one a survival cost, for doing the same task. So, this model does not actually consider division of labor but the evolution of different helper types where both helper types are just as good at doing the single task but perhaps do it differently and so pay different types of costs. In this revision, the authors introduced a modified model where "work" and "defense" must be performed to a similar extent. Although I appreciate their effort, this model modification is rather unnatural and forces the evolution of different helper types if any help is to evolve.
I should end by saying that these comments don't aim to discourage the authors, who have worked hard to put together a worthwhile model and have patiently attended to my reviews. My hope is that these comments can be helpful to build upon what has been done to address the question posed.
Author response:
The following is the authors’ response to the previous reviews
Reviewer #1 (Public review):
This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.
This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. In this round, the authors provided some clarity, but some questions still remain, and I remain unconvinced by a main assumption that was not addressed.
Based on the authors' response, if I understand the life history correctly, dispersers either immediately join another group (with 1-the probability of dispersing), or remain floaters until they successfully compete for a breeder spot or die? Is that correct? I honestly cannot decide because this seems implicit in the first response but the response to my second point raises the possibility of not working while floating but can work if they later join a group as a subordinate. If it is the case that floaters can have multiple opportunities to join groups as subordinates (not as breeders; I assume that this is the case for breeding competition), this should be stated, and more details about how. So there is still some clarification to be done, and more to the point, the clarification that happened only happened in the response. The authors should add these details to the main text. Currently, the main text only says vaguely that joining a group after dispersing " is also controlled by the same genetic dispersal predisposition" without saying how.
In each breeding cycle, individuals have the opportunity to become a breeder, a helper, or a floater. Social role is really just a state, and that state can change in each breeding cycle (see Figure 1). Therefore, floaters may join a group as subordinates at any point in time depending on their dispersal propensity, and subordinates may also disperse from their natal group any given time. In the “Dominance-dependent dispersal propensities” section in the SI, this dispersal or philopatric tendency varies with dominance rank.
We have added: “In each breeding cycle” (L415) to clarify this further.
In response to my query about the reasonableness of the assumption that floaters are in better condition (in the KS treatment) because they don't do any work, the authors have done some additional modeling but I fail to see how that addresses my point. The additional simulations do not touch the feature I was commenting on, and arguably make it stronger (since assuming a positive beta_r -which btw is listed as 0 in Table 1- would make floaters on average be even more stronger than subordinates). It also again confuses me with regard to the previous point, since it implies that now dispersal is also potentially a lifetime event. Is that true?
We are not quite sure where the reviewer gets this idea because we have never assumed a competitive advantage of floaters versus helpers. As stated in the previous revision, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5 in Figure 1) if subordinates are engaged in work tasks. However, floaters also have higher mortality rates than group members, which makes them have lower age averages. In addition, helpers have the advantage of always competing for an open breeding position in the group, while floaters do not have this preferential access (in Figure S2 we reduce even further the likelihood of a floater to try to compete for a breeding position).
Moreover, in the previous revision (section: “Dominance-dependent dispersal propensities” in the SI) we specifically addressed this concern by adding the possibility that individuals, either floaters or subordinate group members, react to their rank or dominance value to decide whether to disperse (if subordinate) or join a group (if floater). Hence, individuals may choose to disperse when low ranked and then remain on the territory they dispersed to as helpers, OR they may remain as helpers in their natal territory as low ranked individuals and then disperse later when they attain a higher dominance value. The new implementation, therefore, allows individuals to choose when to become floaters or helpers depending on their dominance value. This change to the model affects the relative competitiveness between floaters and helpers, which avoids the assumption that either low- or high-quality individuals are the dispersing phenotype and, instead, allows rank-based dispersal as an emergent trait. As shown in Figure S5, this change had no qualitative impact on the results.
To make this all clearer, we have now added to all of the relevant SI tables a new row with the relative rank of helpers vs floaters. As shown, floaters do not consistently outrank helpers. Rather, which role is most dominant depends on the environment and fitness trade-offs that shape their dispersing and helping decisions.
Some further clarifications: beta_r is a gene that may evolve either positive or negative values, 0 (no reaction norm of dispersal to dominance rank) is the initial value in the simulations before evolution takes place. Therefore, this value may evolve to positive or negative values depending on evolutionary trade-offs. Also, and as clarified in the previous comment, the decision to disperse or not occurs at each breeding cycle, so becoming a floater, for example, is not a lifetime event unless they evolve a fixed strategy (dispersal = 0 or 1).
Meanwhile, the simplest and most convincing robustness check, which I had suggested last round, is not done: simply reduce the increase in the R of the floater by age relative to subordinates. I suspect this will actually change the results. It seems fairly transparent to me that an average floater in the KS scenario will have R about 15-20% higher than the subordinates (given no defense evolves, y_h=0.1 and H_work evolves to be around 5, and the average lifespan for both floaters and subordinates are in the range of 3.7-2.5 roughly, depending on m). That could be a substantial advantage in competition for breeding spots, depending on how that scramble competition actually works. I asked about this function in the last round (how non-linear is it?) but the authors seem to have neglected to answer.
As we mentioned in the previous comment above, we have now added the relative rank between helpers and floaters to all the relevant SI tables, to provide a better idea of the relative competitiveness of residents versus dispersers for each parameter combination. As seen in Table S1, the competitive advantage of floaters is only marginally in the favor for floaters in the “Only kin selection” implementation. This advantage only becomes more pronounced when individuals can choose whether to disperse or remain philopatric depending on their rank. In this case, the difference in rank between helpers and floaters is driven by the high levels of dispersal, with only a few newborns (low rank) remaining briefly in the natal territory (Table S6). Instead, the high dispersal rates observed under the “Only kin selection” scenario appear to result from the low incentives to remain in the group when direct fitness benefits are absent, unless indirect fitness benefits are substantially increased. This effect is reinforced by the need for task partitioning to occur in an all-or-nothing manner (see the new implementation added to the “Kin selection and the evolution of division of labor” in the Supplementary materials; more details in following comments).
In addition, we specifically chose not to impose this constraint of forcing floaters to be lower rank than helpers because doing so would require strong assumptions on how the floaters rank is determined. These assumptions are unlikely to be universally valid across natural populations (and probably not commonly met in most species) and could vary considerably among species. Therefore, it would add complexity to the model while reducing generalizability.
As stated in the previous revision, no scramble competition takes place, this was an implementation not included in the final version of the manuscript in which age did not have an influence in dominance. Results were equivalent and we decided to remove it for simplicity prior to the original submission, as the model is already very complex in the current stage; we simply forgot to remove it from Table 1, something we explained in the previous round of revisions.
More generally, I find that the assumption (and it is an assumption) floaters are better off than subordinates in a territory to be still questionable. There is no attempt to justify this with any data, and any data I can find points the other way (though typically they compare breeders and floaters, e.g.: https://bioone.org/journals/ardeola/volume-63/issue-1/arla.63.1.2016.rp3/The-Unknown-Life-of-Floaters--The-Hidden-Face-of/10.13157/arla.63.1.2016.rp3.full concludes "the current preliminary consensus is that floaters are 'making the best of a bad job'."). I think if the authors really want to assume that floaters have higher dominance than subordinates, they should justify it. This is driving at least one and possibly most of the key results, since it affects the reproductive value of subordinates (and therefore the costs of helping).
We explicitly addressed this in the previous revision in a long response about resource holding potential (RHP). Once again, we do NOT assume that dispersers are at a competitive advantage to anyone else. Floaters lack access to a territory unless they either disperse into an established group or colonize an unoccupied territory. Therefore, floaters endure higher mortalities due to the lack of access to territories and group living benefits in the model, and are not always able to try to compete for a breeding position.
The literature reports mixed evidence regarding the quality of dispersing individuals, with some studies identifying them as low-quality and others as high-quality, attributing this to them experiencing fewer constraints when dispersing that their counterparts (e.g. Stiver et al. 2007 Molecular Ecology; Torrents‐Ticó, et al. 2018 Journal of Zoology). Additionally, dispersal can provide end-of-queue individuals in their natal group an opportunity to join a queue elsewhere that offers better prospects, outcompeting current group members (Nelson‐Flower et al. 2018 Journal of Animal Ecology). Moreover, in our model floaters do not consistently have lower dominance values or ranks than helpers, and dominance value is often only marginally different.
In short, we previously addressed the concern regarding the relative competitiveness of floaters compared to subordinate group members. To further clarify this point here, we have now included additional data on relative rank in all of the relevant SI tables. We hope that these additions will help alleviate any remaining concerns on this matter.
Regarding division of labor, I think I was not clear so will try again. The authors assume that the group reproduction is 1+H_total/(1+H_total), where H_total is the sum of all the defense and work help, but with the proviso that if one of the totals is higher than "H_max", the average of the two totals (plus k_m, but that's set to a low value, so we can ignore it), it is replaced by that. That means, for example, if total "work" help is 10 and "defense" help is 0, total help is given by 5 (well, 5.1 but will ignore k_m). That's what I meant by "marginal benefit of help is only reduced by a half" last round, since in this scenario, adding 1 to work help would make total help go to 5.5 vs. adding 1 to defense help which would make it go to 6. That is a pretty weak form of modeling "both types of tasks are necessary to successfully produce offspring" as the newly added passage says (which I agree with), since if you were getting no defense by a lot of food, adding more food should plausibly have no effect on your production whatsoever (not just half of adding a little defense). This probably explains why often the "division of labor" condition isn't that different than the no DoL condition.
The model incorporates division of labor as the optimal strategy for maximizing breeder productivity, while penalizing helping efforts that are limited to either work or defense alone. Because the model does not intend to force the evolution of help as an obligatory trait (breeders may still reproduce in the absence of help; k<sub>0</sub> ≠ 0), we assume that the performance of both types of task by the helpers is a non-obligatory trait that complements parental care.
That said, we recognize the reviewer’s concern that the selective forces modeled for division of labor might not be sufficient in the current simulations. To address this, we have now introduced a new implementation, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. In this implementation, division of labor becomes obligatory for breeders to gain a productivity boost from the help of subordinate group members. The new implementation tests whether division of labor can arise solely from kin selection benefits. Under these premises, philopatry and division of labor do emerge through kin selection, but only when there is a tenfold increase in productivity per unit of help compared to the default implementation. Thus, even if such increases are biologically plausible, they are more likely to reflect the magnitudes characteristic of eusocial insects rather than of cooperatively breeding vertebrates (the primary focus of this model). Such extreme requirements for productivity gains and need for coordination further suggest that group augmentation, and not kin selection, is probably the primary driving force particularly in harsh environments. This is now discussed in L210-213.
Reviewer #2 (Public review):
Summary:
This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. The model considers a population subdivided in groups, each group has a single asexually-reproducing breeder, other group members (subordinates) can perform two types of tasks called "work" or "defense", individuals have different ages, individuals can disperse between groups, each individual has a dominance rank that increases with age, and upon death of the breeder a new breeder is chosen among group members depending on their dominance. "Workers" pay a reproduction cost by having their dominance decreased, and "defenders" pay a survival cost. Every group member receives a survival benefit with increasing group size. There are 6 genetic traits, each controlled by a single locus, that control propensities to help and disperse, and how task choice and dispersal relate to dominance. To study the effect of group augmentation without kin selection, the authors cross-foster individuals to eliminate relatedness. The paper allows for the evolution of the 6 genetic traits under some different parameter values to study the conditions under which division of labour evolves, defined as the occurrence of different subordinates performing "work" and "defense" tasks. The authors envision the model as one of vertebrate division of labor.
The main conclusion of the paper is that group augmentation is the primary factor causing the evolution of vertebrate division of labor, rather than kin selection. This conclusion is drawn because, for the parameter values considered, when the benefit of group augmentation is set to zero, no division of labor evolves and all subordinates perform "work" tasks but no "defense" tasks.
Strengths:
The model incorporates various biologically realistic details, including the possibility to evolve age polytheism where individuals switch from "work" to "defence" tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.
Weaknesses:
The model and its analysis is limited, which makes the results insufficient to reach the main conclusion that group augmentation and not kin selection is the primary cause of the evolution of vertebrate division of labor. There are several reasons.
First, the model strongly restricts the possibility that kin selection is relevant. The two tasks considered essentially differ only by whether they are costly for reproduction or survival. "Work" tasks are those costly for reproduction and "defense" tasks are those costly for survival. The two tasks provide the same benefits for reproduction (eqs. 4, 5) and survival (through group augmentation, eq. 3.1). So, whether one, the other, or both tasks evolve presumably only depends on which task is less costly, not really on which benefits it provides. As the two tasks give the same benefits, there is no possibility that the two tasks act synergistically, where performing one task increases a benefit (e.g., increasing someone's survival) that is going to be compounded by someone else performing the other task (e.g., increasing that someone's reproduction). So, there is very little scope for kin selection to cause the evolution of labour in this model. Note synergy between tasks is not something unusual in division of labour models, but is in fact a basic element in them, so excluding it from the start in the model and then making general claims about division of labour is unwarranted. I made this same point in my first review, although phrased differently, but it was left unaddressed.
The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers, in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care), as we stated in the previous review. Therefore, in this context, helpers may only obtain fitness benefits directly or indirectly by increasing the productivity of the breeders. This benefit is maximized when division of labor occurs between group members as there is a higher return for the least amount of effort per capita. Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. This is not to suggest that the model does not favor synergy, as engaging in two distinct tasks enhances the breeders' productivity more than if group members were to perform only one type of alloparental care task. We have expanded on the need for division of labor by making the performance of each type of task a requirement to boost the breeders productivity, see more details in a following comment.
Second, the parameter space is very little explored. This is generally an issue when trying to make general claims from an individual-based model where only a very narrow parameter region has been explored of a necessarily particular model. However, in this paper, the issue is more evident. As in this model the two tasks ultimately only differ by their costs, the parameter values specifying their costs should be varied to determine their effects. Instead, the model sets a very low survival cost for work (yh=0.1) and a very high survival cost for defense (xh=3), the latter of which can be compensated by the benefit of group augmentation (xn=3). Some very limited variation of xh and xn is explored, always for very high values, effectively making defense unevolvable except if there is group augmentation. Hence, as I stated in my previous review, a more extensive parameter exploration addressing this should be included, but this has not been done. Consequently, the main conclusion that "division of labor" needs group augmentation is essentially enforced by the limited parameter exploration, in addition to the first reason above.
We systematically explored the parameter landscape and report in the body of the paper only those ranges that lead to changes in the reaction norms of interest (other ranges are explored in the SI). When looking into the relative magnitude of cost of work and defense tasks, it is important to note that cost values are not directly comparable because they affect different traits. However, the ranges of values capture changes in the reaction norms that lead to rank-depending task specialization.
To illustrate this more clearly, we have added a new section in the SI (Variation in the cost of work tasks instead of defense tasks section) showing variation in y<sub>h</sub>, which highlights how individuals trade off the relative costs of different tasks. As shown, the results remain consistent with everything we showed previously: a higher cost of work (high y<sub>h</sub>) shifts investment toward defense tasks, while a higher cost of defense (high x<sub>h</sub>) shifts investment toward work tasks.
Importantly, additional parameter values were already included in the SI of the previous revision, specifically to favor the evolution of division of labor under only kin selection. Basically, division of labor under only kin selection does happen, but only under conditions that are very restrictive, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. We have tried to make this point clearer now (see comments to previous reviewer above, and to this reviewer right below).
Third, what is called "division of labor" here is an overinterpretation. When the two tasks evolve, what exists in the model is some individuals that do reproduction-costly tasks (so-called "work") and survival-costly tasks (so-called "defense"). However, there are really no two tasks that are being completed, in the sense that completing both tasks (e.g., work and defense) is not necessary to achieve a goal (e.g., reproduction). In this model there is only one task (reproduction, equation 4,5) to which both "tasks" contribute equally and so one task doesn't need to be completed if the other task compensates for it. So, this model does not actually consider division of labor.
Although it is true that we did not make the evolution of help obligatory and, therefore, did not impose division of labor by definition, the assumptions of the model nonetheless create conditions that favor the emergence of division of labor. This is evident when comparing the equilibria between scenarios where division of labor was favored versus not favored (Figure 2 triangles vs circles).
That said, we acknowledge the reviewer’s concern that the selective forces modeled in our simulations may not, on their own, be sufficient to drive the evolution of division of labor under only kin selection. Therefore, we have now added a section where we restrict the evolution of help to instances in which division of labor is necessary to have an impact on the dominant breeder productivity. Under this scenario, we do find division of labor (as well as philopatry) evolving under only kin selection. However, this behavior only evolves when help highly increases the breeders’ productivity (by a factor of 10 what is needed for the evolution of division of labor under group augmentation). Therefore, group augmentation still appears to be the primary driver of division of labor, while kin selection facilitates it and may, under certain restrictive circumstances, also promote division of labor independently (discussed in L210-213).
Reviewer #1 (Recommendations for the authors):
I really think you should do the simulations where floaters do not come out ahead by floating. That will likely change the result, but if it doesn't, you will have a more robust finding. If it does, then you will have understood the problem better.
As we outlined in the previous round of revisions, implementing this change would be challenging without substantially increasing model complexity and reducing its general applicability, as it would require strong assumptions that could heavily influence dispersal decisions. For instance, by how much should helpers outcompete floaters? Would a floater be less competitive than a helper regardless of age, or only if age is equal? If competitiveness depends on equal age, what is the impact of performing work tasks given that workers always outcompete immigrants? Conversely, if floaters are less competitive regardless of age, is it realistic that a young individual would outcompete all immigrants? If a disperser finds a group immediately after dispersal versus floating for a while, is the dominance value reduced less (as would happen to individuals doing prospections before dispersal)?
Clearly it is not as simple as the referee suggests because there are many scenarios that would need to be considered and many assumptions made in doing this. As we explained to the points above, we think our treatment of floaters is consistent with the definition of floaters in the literature, and our model takes a general approach without making too many assumptions.
Reviewer #2 (Recommendations for the authors):
The paper's presentation is still unclear. A few instances include the following. It is unclear what is plotted in the vertical axes of Figure 2, which is T but T is a function of age t, so this T is presumably being plotted at a specific t but which one it is not said.
The values graphed are the averages of the phenotypically expressed tasks, not the reaction norms per se. We have now rewritten the the axis to “Expressed task allocation T (0 = work, 1 = defense)” to increase clarity across the manuscript.
The section titled "The need for division of labor" in the methods is still very unclear.
We have rephased this whole section to improve clarity.
eLife Assessment
The authors identify the Bearded-type small protein E(spl)m4 as a physical and genetic interactor of TRAF4 in the Drosophila wing disc. These valuable findings with potential biomedical relevance are, however, supported by incomplete evidence based largely on overexpression studies that lack quantification, limited molecular support for their model, and issues with Bearded family protein specificity. The work could be of interest to researchers in the fields of cell signaling and developmental biology.
Reviewer #1 (Public review):
Summary:
The authors investigate how the Drosophila TNF receptor-associated factor Traf4 - a multifunctional adaptor protein with potential E3 ubiquitin ligase activity - regulates JNK signaling and adherens junctions (AJs) in wing disc epithelium. When they overexpress Traf4 in the posterior compartment of the wing disc, many posterior cells express the JNK target gene puckered (puc), apoptose, and are basally extruded from the epithelium. The authors term this process "delamination", but I think that this is an inaccurate description, especially since they can suppress the "delamination" by blocking programmed cell death (by concomitantly overexpressing p35). Through Y2H assays using Traf4 as a bait, they identified the Bearded family proteins E(spl)m4 (and to a lesser extent E(spl)m2), as Traf4 interactors. They use Alphafold to model computationally the interaction between Traf4 and E(spl)m4. They show that co-overexpression of Traf4 with E(spl)m4 in the posterior domain of the wing disc reduces death of posterior cells. They generate a new, weaker hypomorphic allele of Traf4 that is viable (as opposed to the homozygous lethality of null Traf4 alleles). There is some effect of these mutations on wing margin bristles; fewer wing margin bristle defects are seen when E(spl)m4 is overexpressed, suggesting opposite effects of Traf4 and E(spl)m4. Finally, they use the Minute model of cell competition to show that Rp/+ loser clones have greater clone area (indicating increased survival) when they are depleted for Traf4 or when they overexpress E(spl)m4. Only the cell competition results are quantified. Because most of the data in the preprint are not quantified, it is impossible to know how penetrant the phenotypes are. The authors conclude that E(spl)m4 binds the Traf4 MATH/TRAF domain, disrupts Traf4 trimerization, and selectively suppresses Traf4-mediated JNK and caspase activation without affecting its role in AJ destabilization. However, I believe that this is an overstatement. First, there is no biochemical evidence showing that Traf4 binds E(spl)m4 and that E(spl)m4 disrupts Traf4 trimerization. Second, the data on AJs is weak and not quantified; additionally, cells that are being basally extruded lose contact with neighboring cells, hence changes in adhesion proteins. Related to this, the authors, in my opinion, inaccurately describe basal extrusion of dying cells from the wing disc epithelium as delamination.
Strengths:
(1) The authors use multiple approaches to test the model that overexpressed E(spl)m4 inhibits Traf4, including genetics, cell biological imaging, yeast two-hybrid assays, and molecular modeling.
(2) The authors generate a new Traf4 hypomorphic mutant and use this mutant in cell competition studies, which supports the concept that E(spl)m4 (when overexpressed) can antagonize Traf4.
Weaknesses:
(1) Conflation of "delamination" with "basal extrusion of apoptotic cells": Over-expression of Traf4 causes apoptosis in wing disc cells, and this is a distinct process from delamination of viable cells from an epithelium. However, the two processes are conflated by the authors, and this weakens the premise of the paper.
(2) Dependence on overexpression: The conclusions rely heavily on ectopic expression of Traf4 and E(spl)m4. Thus, the physiological relevance of the interaction remains inferred rather than demonstrated.
(3) Lack of quantitative rigor: Except for the cell competition studies, phenotypic descriptions (e.g., number of apoptotic cells, puc-LacZ intensity) are qualitative; additional quantification, inclusion of sample size, and statistical testing would strengthen the conclusions.
(4) Limited biochemical validation: The Traf4-E(spl)m4 binding is inferred from Y2H and in silico models, but no co-immunoprecipitation or in vitro binding assays confirm direct interaction or the predicted disruption of trimerization.
(5) Specificity within the Bearded family: While E(spl)m2 shows partial binding and Tom shows none, the mechanistic basis for this selectivity is not deeply explored experimentally, leaving questions about motif-context contributions unresolved.
Reviewer #2 (Public review):
Summary:
This manuscript analyzes the contribution of Traf4 to the fate of epithelial cells in the developing wing imaginal disc tissue. The manuscript is direct and concise and suggests an interesting and valuable hypothesis with dual functions of Traf4 in JNK pathway activation and cell delamination. However, the text is partially speculative, and the evidence is incomplete as the main claims are only partially supported. Some results require validation to support the conclusions.
Strengths:
(1) The manuscript is direct and concise, with a well-written and precise introduction.
(2) It presents an interesting and valuable hypothesis regarding the dual role of Traf4 in JNK pathway activation and cell delamination.
(3) The study addresses a relevant biological question in epithelial tissue development using a genetically tractable model.
(4) The use of newly generated Traf4 mutants adds novelty to the experimental approach.
(5) The manuscript includes multiple experimental strategies, such as genetic manipulation and imaging, to explore Traf4 function.
Weaknesses:
(1) The evidence supporting key claims is incomplete, and some conclusions are speculative.
(2) The use of GFP-tagged Traf4 lacks validation regarding its functional integrity.
(3) Orthogonal views and additional imaging data are needed to confirm changes in apicobasal localization and cell delamination.
(4) Experimental conditions and additional methods should be further detailed.
(5) The interaction between Traf4 and E(spl)m4 remains speculative in Drosophila.
(6) New mutants require deeper analysis and validation.
(7) The elimination of Traf4 mutant clones may be due to cell competition, which requires further experimental clarification.
(8) The role of Traf4 in cell competition is contradictory and needs to be resolved.
Reviewer #3 (Public review):
Summary:
This is an important and well-conceived study that identifies the Bearded-type small protein E(spl)m4 as a physical and genetic interactor of TRAF4 in Drosophila. By combining classical genetics, yeast two-hybrid assays, and AlphaFold in silico modeling, the authors convincingly demonstrate that E(spl)m4 acts as an inhibitor of TRAF4-mediated induction of JNK-driven apoptosis in developing larval imaginal wing discs, while not affecting TRAF4's role in adherence junction remodeling.
Based primarily on modeling, the authors propose that the specificity of E(spl)m4 towards TRAF4-mediated signaling arises from its interference with TRAF4 trimerization, which is likely required for the activation of the JNK signaling arm but not for the maintenance of adherence junctions and stability of E-cadherin/β-catenin complex.
Overall, this study is of broad interest to cell and developmental biologists. It also holds potential biomedical relevance, particularly for strategies aimed at modulating TRAF protein activities to dissect and modulate canonical versus non-canonical signaling functions.
Strengths:
(1) The work identifies the Bearded-type small protein E(spl)m4 as a physical and genetic interactor of TRAF4 in Drosophila, extending the understanding of E(spl)m4 beyond its established functions in Notch signaling.
(2) The study is experimentally solid, well-executed, and written, combining classical genetics with protein-protein interaction assays and modeling to reveal E(spl)m4 as a new regulator of TRAF4 signaling.
(3) The genetic and biochemical data convincingly show the ability of E(spl)m4 overexpression to inhibit TRAF4-induced JNK-dependent apoptosis, while leaving the TRAF4 role in adherens junction remodeling unaffected.
(4) The findings have important implications for the regulation of cell signaling and apoptosis and may guide pharmacological targeting of TRAF proteins.
Weaknesses:
The study is overall strong; however, several aspects could be clarified or expanded to strengthen the proposed mechanism and data presentation:
(1) The proposed mechanism that E(spl)m4 inhibits TRAF4 activation of JNK signaling by affecting TRAF4 trimerization relies mainly on modeling. Experimental evidence would strengthen this claim. For example, a native or non-denaturing SDS-PAGE could be used to assess TRAF4 oligomerization states in the absence or presence of E(spl)m4 overexpression, testing whether E(spl)m4 interferes with high-molecular-weight TRAF4 assemblies.
(2) The study depends largely on E(spl)m4 overexpression, which may not reflect physiological conditions. It would be valuable to test, or at least discuss, whether loss-of-function or knockdown of E(spl)m4 modulates the strength or duration of JNK-mediated signaling, potentially accelerating apoptosis. Such data would reinforce the model that E(spl)m4 acts as a physiological modulator of TRAF4-JNK signaling in vivo.
(3) The authors initially identify both E(spl)m4 and E(spl)m2 as TRAF4 interactions, but subsequently focus on E(spl)m4. It would be helpful to clarify or discuss the rationale for prioritizing E(spl)m4 for detailed functional analysis.
(4) E(spl)m4 overexpression appears to protect RpS3 loser clones (Figure 6H-K), yet caspase-3-positive cells are still visible in mosaic wing discs. Please comment on the nature of these Caspase 3-positive cells, whether they are cell-autonomous to the clone or non-autonomous (Figure 6K)?
(5) This is a clear, well-executed, and conceptually strong study that significantly advances understanding of TRAF4 signaling specificity and its modulation by the Bearded-type protein E(spl)m4.
eLife Assessment
This important study applies an innovative multi-model strategy to implicate the ribosomal protein (RP) encoding genes as candidates causing Hypoplastic Left Heart Syndrome. The evidence from the screen in stem cell-derived cardiomyocytes and whole genome sequencing of human patients, followed by functional analyses of RP genes in fly and fish models, is convincing and supports the authors' claims. This work and methodology applied would be of broad interest to medical biologists working on congenital heart diseases.
Reviewer #1 (Public review):
Nielsen et al have identified a new disease mechanism underlying hypoplastic left heart syndrome due to variants in ribosomal protein genes that lead to impaired cardiomyocyte proliferation. This detailed study starts with an elegant screen in stem cell derived cardiomyocytes and whole genome sequencing of human patients and extends to careful functional analysis of RP gene variants in fly and fish models. Striking phenotypic rescue is seen by modulating known regulators of proliferation including the p53 and Hippo pathways. Additional experiments suggest that cell type specificity of the variants in these ubiquitously expressed genes may result from genetic interactions with cardiac transcription factors. This work positions RPs as important regulators of cardiomyocyte proliferation and differentiation involved in the etiology of HLHS, and point to potential downstream mechanisms.
The revised manuscript has been extended, facilitating interpretation and reinforcing the authors' conclusions.
Reviewer #2 (Public review):
Tanja Nielsen et al. presents a novel strategy for identification of candidate genes in Congenital Heart Disease (CHD). Their methodology, which is based on comprehensive experiments across cell models, drosophila and zebrafish models, represents an innovative, refreshing and very useful set of tools for identification of disease genes, in a field which are struggling with exactly this problem.
The authors have applied their methodology to investigate the pathomechanisms of Hypoplastic Left Heart Syndrome (HLHS) - a severe and rare subphenotype in the large spectrum of CHD malformations. Their data convincingly implicates ribosomal proteins (RPs) in growth and proliferation defects of cardiomyocytes, a mechanism which is suspected to be associated with HLHS.
By whole genome sequencing analysis of a small cohort of trios (25 HLHS patients and their parents) the authors investigated a possible association between RP encoding genes and HLHS.
Although the possible association between defective RPs and HLHS needs to be verified, the results suggest a novel disease mechanism in HLHS, which is a potentially substantial advance in our understanding of HLHS and CHD. The conclusions of the paper are based on solid experimental evidence from appropriate high- to medium-throughput models, while additional genetic results from an independent patient cohort is needed to verify an association between RP encoding genes and HLHS in patients.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
Nielsen et al have identified a new disease mechanism underlying hypoplastic left heart syndrome due to variants in ribosomal protein genes that lead to impaired cardiomyocyte proliferation. This detailed study starts with an elegant screen in stemcell-derived cardiomyocytes and whole genome sequencing of human patients and extends to careful functional analysis of RP gene variants in fly and fish models. Striking phenotypic rescue is seen by modulating known regulators of proliferation, including the p53 and Hippo pathways. Additional experiments suggest that the cell type specificity of the variants in these ubiquitously expressed genes may result from genetic interactions with cardiac transcription factors. This work positions RPs as important regulators of cardiomyocyte proliferation and differentiation involved in the etiology of HLHS, although the downstream mechanisms are unclear.
We thank Reviewer 1 for the thoughtful assessment of our manuscript. Our point-bypoint responses to the recommendations are provided (Reviewer 1, “Recommendations for the authors”).
Reviewer #2 (Public review):
Tanja Nielsen et al. present a novel strategy for the identification of candidate genes in Congenital Heart Disease (CHD). Their methodology, which is based on comprehensive experiments across cell models, Drosophila and zebrafish models, represents an innovative, refreshing and very useful set of tools for the identification of disease genes, in a field which are struggling with exactly this problem. The authors have applied their methodology to investigate the pathomechanisms of Hypoplastic Left Heart Syndrome (HLHS) - a severe and rare subphenotype in the large spectrum of CHD malformations. Their data convincingly implicates ribosomal proteins (RPs) in growth and proliferation defects of cardiomyocytes, a mechanism which is suspected to be associated with HLHS.
By whole genome sequencing analysis of a small cohort of trios (25 HLHS patients and their parents), the authors investigated a possible association between RP encoding genes and HLHS. Although the possible association between defective RPs and HLHS needs to be verified, the results suggest a novel disease mechanism in HLHS, which is a potentially substantial advance in our understanding of HLHS and CHD. The conclusions of the paper are based on solid experimental evidence from appropriate high- to medium-throughput models, while additional genetic results from an independent patient cohort are needed to verify an association between RP encoding genes and HLHS in patients.
We thank Reviewer 2 for the thoughtful assessment of our manuscript. Our point-by-point responses to the recommendations are provided (Reviewer 2, “Recommendations for the authors”).
Reviewer #1 (Recommendations for the authors):
(1) Despite an interesting surveillance model, the disease-causing mechanisms directly downstream of the RP variants remain unclear. Can the authors provide any evidence for abnormal ribosomes or defects in translation in cells harboring such variants? The possibility that reduced translation of cardiac transcription factors such as TBX5 and NKX2-5 may contribute to the functional interactions observed should be considered. How do the authors consider that the RP variants are affecting transcript levels as observed in the study?
Our model implies that cell cycle arrest does not require abnormal ribosomes or translational defects but instead relies on the sensing of RP levels or mutations as a fitness-sensing mechanism that activates TP53/CDKN1A-dependent arrest. Supporting this framework, we observed no significant changes in TBX5 or NKX2-5 expression (data not shown), but rather an upregulation of CDKN1A levels upon RP KD.
(2) The authors suggest that a nucleolar stress program is activated in cells harboring RP gene variants. Can they provide additional evidence for this beyond p53 activation?
We added additional data to support nucleolar stress (Suppl. Fig. 6) and text (lines 52635):
To determine whether cardiac KD of RpS15Aa causes nucleolar stress in the Drosophila heart, we stained larval hearts for Fibrillarin, a marker for nucleoli and nucleolar integrity. We found that RpS15Aa KD causes expansion of nucleolar Fibrillarin staining in cardiomyocyte, which is a hallmark of nucleolar stress (Suppl. Fig. 6A-C). As a control, we also performed cardiac KD of Nopp140, which is known to cause nucleolar stress upon loss-of-function. We found a similar expansion of Fibrillarin staining in larval cardiomyocyte nuclei (Suppl. Fig. 6C,D). This suggests that RpS15Aa KD indeed causes nucleolar stress in the Drosophila heart, that likely contributes to the dramatic heart loss in adults.
Other recommendations:
(3) Concerning the cell type specificity, in the proliferation screen, were similar effects seen on the actinin negative as actinin positive EdU+ cells? It would be helpful to refer to the fibroblast result shown in Supplementary Figure 1C in the results section.
As suggested by reviewer #1, we have added a reference to Supplementary Fig. 1C, D and noted that RP knockdown exerts a non–CM-specific effect on proliferation.
(4) The authors refer to HLHS patients with atrial septal defects and reduced right ventricular ejection fraction. Please clarify the specificity of the new findings to HLHS versus other forms of CHD, as implied in several places in the manuscript, including the abstract.
This study focused on a cohort of 25 HLHS proband-parent trios selected for poor clinical outcome, including restrictive atrial septal defect and reduced right ventricular ejection fraction. We have revised the following sentence in response to the Reviewer’s comment (lines 567-571): “While our study highlights the potential of this approach for gene prioritization, additional research is needed to directly demonstrate the functional consequence of the identified genetic variants, verify an association between RP encoding genes and HLHS in other patient cohorts with and without poor outcome, and determine if RP variants have a broader role in CHD susceptibility.
(5) The multi-model approach taken by the authors is clearly a good system for characterizing disease-causing variants. Did the authors score for cardiomyocyte proliferation or the time of phenotypic onset in the zebrafish model?
We used an antibody against phosphohistone 3 to identify proliferating cells and DAPI to identify all cardiac cells in control injected, rps15a morphants, and rps15a crispants. We found that cell numbers and proliferating cells were significantly reduced at 24 and 48 hpf. By 72 hpf cardiac cell proliferation is greatly diminished even in controls, where proliferation typically declines.
Reduced ventricular cardiomyocyte numbers could potentially result from impaired addition of LTPB3-expressing progenitors. In experiments where altered cardiac rhythm is observed, please comment on the possible links to proliferation.
Heart function data showed that heart period (R-R interval) was unaffected in morphants and crispants at 72 hpf where we also observed significant reductions in cell numbers. This suggests that the bradycardia observed in the rps15a + nkx2.5 or tbx5a double KD (Sup. Fig. 5D & E) was not due to the reduction in cell numbers alone.
Author response image 1.
Finally, the use of the mouse to model HLHS in potential follow-up studies should be discussed.
We have added a mouse model comment to the discussion (lines 571-74): “In conclusion, we propose that the approach outlined in this study provides a novel framework for rapidly prioritizing candidate genes and systematically testing them, individually or in combination, using a CRISPR/Cas9 genome-editing strategy in mouse embryos (PMID: 28794185)”.
(6) When the authors scored proliferation in cells from the proband in family 75H, did they validate that RPS15A expression is reduced, consistent with a regulatory region defect?
Good point. We examined RPS15A expression in these cells and found no significant reduction in gene expression in day 25 cardiomyocytes (data not shown). One possible explanation is that this variant may regulate RPS15A expression in a stage-specific manner during differentiation or under additional stress conditions.
(7) Minor point. Typo on line 494: comma should be placed after KD, not before.
Thank you, this has now been corrected (new line 490)
Reviewer #2 (Recommendations for the authors):
(1) The authors are invited to revise the part of the manuscript that describes the genetic analysis and provide a more balanced discussion of the WGS data, with a conclusion that aligns with the strength of the human genetic data.
We disagree with reviewer #2’s assessment. The goal of our study is not to apply a classical genetic approach to establish variant pathogenicity, but rather to employ a multidisciplinary framework to prioritize candidate genes and variants and to examine their roles in heart development using model systems. In this context, genetic analysis serves primarily as a filtering tool rather than as a means of definitively establishing causality.
(2) The genetic analysis of patients does not appear to provide strong evidence for an association between RP gene variants and HLHS. More information regarding methodology and the identified variants is needed.
HLHS is widely recognized as an oligogenic and heterogeneous genetic disease in which traditional genetic analyses have consistently failed to prioritize any specific gene class as reviewer#2 is pointing out. Therefore, relying solely on genetic analysis is unlikely to yield strong evidence for association with a given gene class. This limitation provides the rationale for our multidisciplinary gene prioritization strategy, which leverages model systems to interrogate candidate gene function. Ultimately, definitive validation of this approach will require studies in relevant in vivo models to establish causality within the context of a four-chambered heart (see also Discussion).
In Table S2, it would be appropriate to provide information on sequence, MAF, and CADD. Please note the source of MAF% (GnomAD version?, which population?).
As summarized in Figure 2A, the 292 genes from the families with the 25 proband with poor outcome displayed in Supplemental Table 2 fulfilled a comprehensive candidate gene prioritization algorithm based on the variant, gene, inheritance, and enrichment, which required all of the following: 1) variants identified by whole genome sequencing with minor allele frequency <1%; 2) missense, loss-of-function, canonical splice, or promoter variants; 3) upper quartile fetal heart expression; and 4)De novo or recessive inheritance. Unbiased network analysis of these 292 genes, which are displayed in Supplemental Table 2 for completeness, identified statistically significant enrichment of ribosomal proteins. The details about MAF, CADD score, and sequence highlighted by the Reviewer are provided for the RP genes in Table 1, which are central to the focus and findings of the manuscript.
It would also be helpful for the reader if genome coordinates (e.g., 16-11851493-G-A for RSL1D1 p.A7V) were provided for each variant in both Table 1 and S2.
Genome coordinates have been added to Table 1.
(3) The dataset from the hPSC-CM screen could be of high value for the community. It would be appropriate if the complete dataset were made available in a usable format.
The dataset from the hPSC-CM screen has been added to the manuscript as Supp Table 1
(4) The "rare predicted-damaging promoter variant in RPS15A" (c.-95G>A) does not appear so rare. Considering the MAF of 0,00662, the frequency of heterozygous carriers of this variant is 1 out of 76 individuals in the general population. Thus, considering the frequency of HLHS in the population (2-3 out of 10,000) and the small size of family 75H, the data do not appear to indicate any association between this particular variant and HLHS. The variants in Table 1 also appear to have relatively mild effects on the gene product, judging from the MAF and CADD scores. The authors are invited to discuss why they find these variants disease-causing in HLHS.
Our study design is based on the widely held premise that HLHS is an oligogenic disorder. Our multi-model systems platform centered on comprehensive filtering of coding and regulatory variants identified by whole genome sequencing of HLHS probands to identify candidate genes associated with susceptibility to this rare developmental phenotype. 75H proved to be a high-value family for generating a relatively short list of candidate genes for left-sided CHD. Given the rarity of both left-sided CHD and the RPS15A variant identified in the HLHS proband and his 5th degree relative, with a frequency consistent with a risk allele for an oligogenic disorder, we made the reasonable assumption that this was a bona fide genotype-phenotype association rather than a chance occurrence. Moreover, incomplete penetrance and variable expression is consistent with a genetically complex basis of disease whereby the shared variant is risk-conferring and acts in conjunction with additional genetic, epigenetic, and/or environmental factors that lead to a left-sided CHD phenotype. In sum, we do not claim these variants are definitively disease causing, but rather potentially contributing risk factors.
(5) Information is lacking on how clustering of RP genes was demonstrated using STRING (with P-values that support the conclusions). What is meant by "when the highest stringency filter was applied"? Does this refer to the STRING interaction score or something else? The authors could also explain which genes were used to search STRING (e.g., all 292 candidate genes) and provide information on the STRING interaction score used in the analysis, the number of nodes and edges in the network.
To determine whether certain gene networks were over-represented, two online bioinformatics tools were used. First, genes were inputted into STRING (Author response table 2 below) to investigate experimental and predicted protein-protein and genetic interactions. Clustering of ribosomal protein genes was demonstrated when applying the highest stringency filter. Next, genes were analyzed for potential enrichment of genes by ontology classification using PANTHER .Applying Fisher’s exact test and false discovery rate corrections, ribosomal proteins were the most enriched class when compared to the reference proteome, including data annotated by molecular function (4.84-fold, p=0.02), protein class (6.45-fold, p=0.00001), and cellular component (9.50fold, p=0.001). A majority of the identified RP candidate genes harbored variants that fit a recessive inheritance disease model.
Author response image 2.
eLife Assessment
This valuable work substantially advances our understanding of prognostic value of total gfDNA in gastric cancer. The evidence supporting the conclusions is solid, supported by a large, well-classified patient cohort and controlled clinical variables. The work will be of broad interest to scientists and clinical pathologist working in the field of gastric cancer.
Reviewer #1 (Public review):
The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:
(1) This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.
(2) The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.
(3) The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.
(4) The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort.
(5) There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc.
(6) The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn.
Comments on revisions:
The authors have addressed all concerns in the revision.
Reviewer #2 (Public review):
Summary
The authors aimed to evaluate whether total DNA concentration in gastric fluid (gfDNA) collected during routine endoscopy could serve as a diagnostic and prognostic biomarker for gastric cancer. Using a large cohort (n=941), they reported elevated gfDNA in gastric cancer patients, an unexpected association with improved survival, and a positive correlation with immune cell infiltration.
Strengths
The study benefits from a substantial sample size, clear patient stratification, and control of key clinical confounders. The method is simple and clinically feasible, with preliminary evidence linking gfDNA to immune infiltration.
Weaknesses
(1) While the study identifies gfDNA as a potential prognostic tool, the evidence remains preliminary. Unexplained survival associations and methodological gaps weaken support for the conclusions.
(2) The paradoxical association between high gfDNA and better survival lacks mechanistic validation. The authors acknowledge but do not experimentally distinguish tumor vs. immune-derived DNA, leaving the biological basis speculative.
(3) Pre-analytical variables were noted but not systematically analyzed for their impact on gfDNA stability.
Comments on revisions:
To enhance the completeness and credibility of this research, it is essential to clarify the biological origin of gastric fluid DNA and validate these preliminary findings through a prospective, longitudinal study design.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
“The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:”
(1) “This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.”
We are grateful for this astute remark. A comparison of gfDNA concentration among the diagnostic groups indicates a trend of increasing values as the diagnosis progresses toward malignancy. The observed values for the diagnostic groups are as follows:
Author response table 1.
The chart below presents the statistical analyses of the same diagnostic/tumor-stage groups (One-Way ANOVA followed by Tukey’s multiple comparison tests). It shows that gastric fluid gfDNA concentrations gradually increase with malignant progression. We observed that the initial tumor stages (T0 to T2) exhibit intermediate gfDNA levels, which in this group is significantly lower than in advanced disease (p = 0.0036), but not statistically different from non-neoplastic disease (p = 0.74).
Author response image 1.
(2) “The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.”
We appreciate the attention to detail regarding the numbers analyzed in the manuscript. Importantly, the results are meaningful because the number of subjects in each group is comparable (T0-T2, N = 65; T3, N = 91; T4, N = 63). The mean gastric fluid gfDNA values (ng/µL) increase with disease stage (T0-T2: 15.12; T3-T4: 30.75), and both are higher than the mean gfDNA values observed in non-neoplastic disease (10.81 ng/µL for N+PD and 10.10 ng/µL for PN). These subject numbers in each diagnostic group accurately reflect real-world data from a tertiary cancer center.
(3) “The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.”
Histopathological analyses were performed throughout the study not only for the initial diagnosis of tissue biopsies, but also for the classification of Lauren’s subtypes, tumor staging, and the assessment of the presence and extent of immune cell infiltrates. Regarding the time of disease onset, this variable is inherently unknown--by definition--at the time of a diagnostic EGD. While the prognosis definition is indeed straightforward, we believe that a simple, cost-effective, and practical approach is advantageous for patients across diverse clinical settings and is more likely to be effectively integrated into routine EGD practice.
(4) “The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort. “
We wish to reinforce that EGD, along with conventional histopathology, remains the gold standard for gastric cancer evaluation. EGD under sedation is routinely performed for diagnosis, and the collection of gastric fluids for gfDNA evaluation does not affect patient comfort. Thus, while gfDNA analysis was evidently not intended as a diagnostic EGD and biopsy replacement, it may provide added prognostic value to this exam.
(5) “There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc. “
We are grateful for these comments and apologize for the clerical oversight. All figures, tables, titles and figure legends have now been double-checked.
(6) “The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn. “
We assume that the unusual wording remark regarding “overall logicality” pertains to the rationale and/or reasoning of this investigational study. Our working hypothesis was that during neoplastic disease progression, tumor cells continuously proliferate and, depending on various factors, attract immune cell infiltrates. Consequently, both tumor cells and immune cells (as well as tumor-derived DNA) are released into the fluids surrounding the tumor at its various locations, including blood, urine, saliva, gastric fluids, and others. Thus, increases in DNA levels within some of these fluids have been documented and are clinically meaningful. The concurrent observation of elevated gastric fluid gfDNA levels and immune cell infiltration supports the hypothesis that increased gfDNA—which may originate not only from tumor cells but also from immune cells—could be associated with better prognosis, as suggested by this study of a large real-world patient cohort.
In summary, we thank Reviewer #1 for his time and effort in a constructive critique of our work.
Reviewer #2 (Public review):
Summary:
“The authors investigated whether the total DNA concentration in gastric fluid (gfDNA), collected via routine esophagogastroduodenoscopy (EGD), could serve as a diagnostic and prognostic biomarker for gastric cancer. In a large patient cohort (initial n=1,056; analyzed n=941), they found that gfDNA levels were significantly higher in gastric cancer patients compared to non-cancer, gastritis, and precancerous lesion groups. Unexpectedly, higher gfDNA concentrations were also significantly associated with better survival prognosis and positively correlated with immune cell infiltration. The authors proposed that gfDNA may reflect both tumor burden and immune activity, potentially serving as a cost-effective and convenient liquid biopsy tool to assist in gastric cancer diagnosis, staging, and follow-up.”
Strengths:
“This study is supported by a robust sample size (n=941) with clear patient classification, enabling reliable statistical analysis. It employs a simple, low-threshold method for measuring total gfDNA, making it suitable for large-scale clinical use. Clinical confounders, including age, sex, BMI, gastric fluid pH, and PPI use, were systematically controlled. The findings demonstrate both diagnostic and prognostic value of gfDNA, as its concentration can help distinguish gastric cancer patients and correlates with tumor progression and survival. Additionally, preliminary mechanistic data reveal a significant association between elevated gfDNA levels and increased immune cell infiltration in tumors (p=0.001).”
Reviewer #2 has conceptually grasped the overall rationale of the study quite well, and we are grateful for their assessment and comprehensive summary of our findings.
Weaknesses:
(1) “The study has several notable weaknesses. The association between high gfDNA levels and better survival contradicts conventional expectations and raises concerns about the biological interpretation of the findings.“
We agree that this would be the case if the gfDNA was derived solely from tumor cells. However, the findings presented here suggest that a fraction of this DNA would be indeed derived from infiltrating immune cells. The precise determination of the origin of this increased gfDNA remains to be achieved in future follow-up studies, and these are planned to be evaluated soon, by applying DNA- and RNA-sequencing methodologies and deconvolution analyses.
(2) “The diagnostic performance of gfDNA alone was only moderate, and the study did not explore potential improvements through combination with established biomarkers. Methodological limitations include a lack of control for pre-analytical variables, the absence of longitudinal data, and imbalanced group sizes, which may affect the robustness and generalizability of the results.“
Reviewer #2 is correct that this investigational study was not designed to assess the diagnostic potential of gfDNA. Instead, its primary contribution is to provide useful prognostic information. In this regard, we have not yet explored combining gfDNA with other clinically well-established diagnostic biomarkers. We do acknowledge this current limitation as a logical follow-up that must be investigated in the near future.
Moreover, we collected a substantial number of pre-analytical variables within the limitations of a study involving over 1,000 subjects. Longitudinal samples and data were not analyzed here, as our aim was to evaluate prognostic value at diagnosis. Although the groups are imbalanced, this accurately reflects the real-world population of a large endoscopy center within a dedicated cancer facility. Subjects were invited to participate and enter the study before sedation for the diagnostic EGD procedure; thus, samples were collected prospectively from all consenting individuals.
Finally, to maintain a large, unbiased cohort, we did not attempt to balance the groups, allowing analysis of samples and data from all patients with compatible diagnoses (please see Results: Patient groups and diagnoses).
(3) “Additionally, key methodological details were insufficiently reported, and the ROC analysis lacked comprehensive performance metrics, limiting the study's clinical applicability.“
We are grateful for this useful suggestion. In the current version, each ROC curve (Supplementary Figures 1A and 1B) now includes the top 10 gfDNA thresholds, along with their corresponding sensitivity and specificity values (please see Suppl. Table 1). The thresholds are ordered from-best-to-worst based on the classic Youden’s J statistic, as follows:
Youden Index = specificity + sensitivity – 1 [Youden WJ. Index for rating diagnostic tests. Cancer 3:32-35, 1950. PMID: 15405679]. We have made an effort to provide all the key methodological details requested, but we would be glad to add further information upon specific request.
Reviewer #1 (Recommendations for the authors):
The authors should pay attention to ensuring uniformity in the format of all cited references, such as the number of authors for each reference, the journal names, publication years, volume numbers, and page number formats, to the best extent possible.
Thank you for pointing this inconsistency. All cited references have now been revisited and adjusted properly. We apologize for this clerical oversight.
Reviewer #2 (Recommendations for the authors):
(1) “High gfDNA levels were surprisingly linked to better survival, which conflicts with the conventional understanding of cfDNA as a tumor burden marker. Was any qualitative analysis performed to distinguish DNA derived from immune cells versus tumor cells?“
Tumor-derived DNA is certainly present in gfDNA, as our group has unequivocally demonstrated in a previous publication [Pizzi M. P., et al. (2019) Identification of DNA mutations in gastric washes from gastric adenocarcinoma patients: Possible implications for liquid biopsies and patient follow-up Int J Cancer 145:1090–1097. DOI: 10.1002/ijc.32114]. However, in the present manuscript, our data suggest that gfDNA may also contain DNA derived from infiltrating immune cells. This may also be the case for other malignancies, and qualitative deconvolution studies could provide more informative information. To achieve this, DNA sequencing and RNA-Seq analyses may offer relevant evidence. Our study should be viewed as an original and preliminary analysis that may encourage such quantitative and qualitative studies in biofluids from cancer patients. Currently, this is a simple approach (which might be its essential beauty), but we hope to investigate this aspect further in future studies.
(2) “The ROC curve AUC was 0.66, indicating only moderate discrimination ability. Did the authors consider combining gfDNA with markers such as CEA or CA19-9 to improve diagnostic accuracy?“
This is indeed a logical idea, which shall certainly be explored in planned follow-up studies.
(3) “DNA concentration could be influenced by non-biological factors, including gastric fluid pH, sampling location, time delay, or freeze-thaw cycles. Were these operational variables assessed for their effect on data stability?“
We appreciate the rigor of the evaluation. Yes, information regarding gastric fluid pH was collected. All samples were collected from the stomach during EGD procedure. Samples were divided in aliquots and were thawed only once. This information is now provided in the updated manuscript text.
(4) “This cross-sectional study lacks data on gfDNA changes over time, limiting conclusions on its utility for monitoring treatment response or predicting recurrence.“
Again, temporal evaluation is another excellent point, and it will be the subject of future analyses. In this exploratory study, samples were collected at diagnosis, at a single point. We have not obtained serial samples, as participants received appropriate therapy soon following diagnosis.
(5) The normal endoscopy group included only 10 patients, the precancerous lesion group 99 patients, while the gastritis group had 596 patients. Such uneven sample sizes may affect statistical reliability and generalizability. Has weighted analysis or optimized sampling been considered for future studies?“
Yes, in future studies this analysis will be considered, probably by employing stratified random sampling with relevant patient attributes recorded.
(6) “The SciScore was only 2 points, indicating that key methodological details such as inclusion/exclusion criteria, randomization, sex variables, and power calculation were not clearly described. It is recommended that these basic research elements be supplemented in the Methods section. “
This was an exploratory research, the first of its kind, to evaluate prognostic potential of gfDNA in the context of gastric cancer. Patients were not included if they did not sign the informed consent or excluded if they withdrew after consenting. Other exclusion criteria included diagnoses of conditions such as previous gastrectomy or esophagectomy, or the presence of non-gastric malignancies. Randomization and power analyses were not applicable, as no prior data were available regarding gfDNA concentration values or its diagnostic/prognostic potential. All subjects, regardless of sex, were invited to participate without discrimination or selection.
(7) “Although a ROC curve was provided in the supplementary materials (Supplementary Figure 1), only the curve and AUC value were shown without sensitivity, specificity, predictive values, or cutoff thresholds. The authors are advised to provide a full ROC performance assessment to strengthen the study's clinical relevance.
These data are now given alongside the ROC curves in the Supplementary Information section, specifically in Supplementary Figure 1 and in the newly added Supplementary Table 1.
We thank Reviewer #2 for an insightful and positive overall assessment of our work.
eLife Assessment
This study presents a valuable finding on whether executive resources mediate the impact of language predictability in reading in the context of aging. The evidence is solid in the investigation of prediction in reading, with one caveat that the text materials used could be biased against the aging population. The work will be of interest to cognitive neuroscientists working on reading, language comprehension, and executive control.
Reviewer #1 (Public review):
The authors of this study set out to address a central question in the psycholinguistics literature: does the human brain's ability to predict upcoming language come at a cognitive cost, or is it an automatic, "free" process? To investigate this, they employed a dual-task paradigm where participants read texts word-by-word while simultaneously performing a secondary task (an n-back task on font color) designed to manipulate cognitive load. The study examines how this external cognitive load, along with the effects of aging, modulates the impact of word predictability (measured by surprisal and entropy) on reading times. The central finding is that increased cognitive load diminishes the effects of word predictability, supporting the conclusion that language prediction is a resource-dependent process.
A major strength of the revised manuscript is its comprehensive and parallel analysis of both word surprisal and entropy. The initial submission focused almost exclusively on surprisal, which primarily reflects the cost of integrating a word into its context after it has been perceived. The new analysis now thoroughly investigates entropy as well, which reflects the uncertainty and cognitive effort involved in predicting the next word before it appears. This addition provides a much more complete and theoretically nuanced picture, allowing the authors to address how cognitive load affects both predictive and integrative stages of language processing. This is a significant improvement and substantially increases the paper's contribution to the field.
Furthermore, the authors have commendably addressed the initial concerns regarding the robustness of their replication findings. The first version of the manuscript presented replication results that were inconsistent, particularly for key interaction effects. In the revision, the authors have adopted a more focused and appropriately powered modeling approach for the replication analysis. This revised analysis now demonstrates a consistent effect of cognitive load on the processing of predictable words across both the original and replication datasets. This strengthens the evidence for the paper's primary claim.
The initial review also raised concerns that the results could be explained by general cognitive factors, such as task-switching costs, rather than the specific demands on the language prediction system. While the complexity of cognitive load in a dual-task paradigm remains a challenge, the authors have provided sufficient justification in their revisions and rebuttal to support their interpretation that the observed effects are genuinely tied to the process of language prediction.
Reviewer #2 (Public review):
Summary:
This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies).
There is a major concern about the effects of age. See the results (155-190): this depends what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experience. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed.
Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt if we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading.
The authors respond to my summary comment by saying that prediction is individual and that they account for age-related effects in their models. But these aren't my concerns. Rather:
(1) The texts (these edited newspaper articles) could be more predictable for younger than older adults. If so, effects with older adults could simply be because people are less likely to predict less than more predictable words.
(2) The GPT-2 generated surprisal scores may correspond more closely to younger than older adult responses -- that is, its next word predictions may be more younger- than older-adult-like.
In my view, the authors have two choices: they could remove the discussion of age-related effects, or they could try to address BOTH (1) and (2).
As an aside, consider what we would conclude if we drew similar conclusions from a study in which children and adults read the same (children's) texts, but we didn't test what was predictable to each of them separately.
The paper is really strong in other respects and if my concern is not addressed, the conclusions about age might be generally accepted.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
This manuscript reports a dual-task experiment intended to test whether language prediction relies on executive resources, using surprisal-based measures of predictability and an n-back task to manipulate cognitive load. While the study addresses a question under debate, the current design and modeling framework fall short of supporting the central claims. Key components of cognitive load, such as task switching, word prediction vs integration, are not adequately modeled. Moreover, the weak consistency in replication undermines the robustness of the reported findings. Below unpacks each point.
Cognitive load is a broad term. In the present study, it can be at least decomposed into the following components:
(1) Working memory (WM) load: news, color, and rank.
(2) Task switching load: domain of attention (color vs semantics), sensorimotor rules (c/m vs space).
(3) Word comprehension load (hypothesized against): prediction, integration.
The components of task switching load should be directly included in the statistical models. Switching of sensorimotor rules may be captured by the "n-back reaction" (binary) predictor. However, the switching of attended domains and the interaction between domain switching and rule complexity (1-back or 2-back) were not included. The attention control experiment (1) avoided useful statistical variation from the Read Only task, and (2) did not address interactions. More fundamentally, task-switching components should be directly modeled in both performance and full RT models to minimize selection bias. This principle also applies to other confounding factors, such as education level. While missing these important predictors, the current models have an abundance of predictors that are not so well motivated (see later comments). In sum, with the current models, one cannot determine whether the reduced performance or prolonged RT was due to affecting word prediction load (if it exists) or merely affecting the task switching load.
The entropy and surprisal need to be more clearly interpreted and modeled in the context of the word comprehension process. The entropy concerns the "prediction" part of the word comprehension (before seeing the next word), whereas surprisal concerns the "integration" part as a posterior. This interpretation is similar to the authors writing in the Introduction that "Graded language predictions necessitate the active generation of hypotheses on upcoming words as well as the integration of prediction errors to inform future predictions [1,5]." However, the Results of this study largely ignored entropy (treating it as a fixed effect) and only focus on surprisal without clear justification.
In Table S3, with original and replicated model fitting results, the only consistent interaction is surprisal x age x cognitive load [2-back vs. Reading Only]. None of the two-way interactions can be replicated. This is puzzling and undermines the robustness of the main claims of this paper.
Reviewer #2 (Public review):
Summary
This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies).
Strengths/weaknesses
It is important to be clear that predictability is not the same as prediction. A predictable word is processed faster than an unpredictable word (something that has been known since the 1970/80s), e.g., Rayner, Schwanenfluegel, etc. But this could be due to ease of integration. I think this issue can probably be dealt with by careful writing (see point on line 18 below). To be clear, I do not believe that the effects reported here are due to integration alone (i.e., that nothing happens before the target word), but the evidence for this claim must come from actual demonstrations of prediction.
The effect of load on the effects of predictability is very interesting (and also, I note that the fairly novel way of assessing load is itself valuable). Assuming that the experiments do measure prediction, it suggests that they are not cost-free, as is sometimes assumed. I think the researchers need to look closely at the visual world literature, most particularly the work of Huettig. (There is an isolated reference to Ito et al., but this is one of a large and highly relevant set of papers.)
There is a major concern about the effects of age. See the Results (161-5): this depends on what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experiences. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people, and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed.
Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt that we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading.
We thank both reviewers for their constructive feedback and for highlighting areas where our theoretical framing and analyses could be clarified and strengthened. We have carefully considered each of the points raised and made substantial additions and revisions.
As a summary, we have directly addressed the concerns raised by the reviewers by incorporating task-switching predictors into the statistical models, paralleling our focus on surprisal with a full analysis and interpretation of entropy, clarifying the robustness (and limitations) of the replicated findings, and addressing potential limitations in our Discussion.
We believe these revisions substantially strengthen the manuscript and improve the reading flow, while also clarifying the scope of our conclusions. We will not illustrate these changes in more detail:
(1) Cognitive load and task-switching components.
We agree that cognitive load is a multifaceted construct, particularly since our secondary task broadly targets executive functioning. In response to Reviewer 1, we therefore examined task-switching demands more closely by adding the interaction term n-back reaction × cognitive load to a model restricted to 1-back and 2-back Dual Task blocks (as there were no n-back reactions in the Reading Only condition). This analysis showed significantly longer reading times in the 2-back than in the 1back condition, both for trials with and without an n-back reaction. Interestingly, the difference between reaction and no-reaction trials was smaller in the 2-back condition (β = -0.132, t(188066.09) = -34.269, p < 0.001), which may simply reflect the general increase in reading time for all trials so that the effect of the button press time decreases in comparison to the 1-back. In that sense, these findings are not unexpected and largely mirror the main effect of cognitive load. Crucially, however, the three-way interaction of cognitive load, age, and surprisal remained robust (β = 0.00004, t(188198.86) = 3.540, p < 0.001), indicating that our effects cannot be explained by differences in taskswitching costs across load conditions. To maintain a streamlined presentation, we opted not to include this supplementary analysis in the manuscript.
(2) Entropy analyses.
Reviewer 1 pointed out that our initial manuscript placed more emphasis on surprisal. In the revised manuscript, we now report a full set of entropy analyses in the supplementary material. In brief, these analyses show that participants generally benefit from lower entropy across cognitive load conditions, with one notable exception: young adults in the Reading Only condition, where higher entropy was associated with faster reading times. We have added these results to the manuscript to provide a more complete picture of the prediction versus integration distinction highlighted in the review (see sections “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and PostStimulus Predictive Processing” in the Methods and “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing“ in the Results).
(3) Replication consistency.
Reviewer 1 noted that the results of the replication analysis were somewhat puzzling. We take this point seriously and agree that the original model was likely underpowered to detect the effect of interest. To address this, we excluded the higher-level three-way interaction of age, cognitive load, and surprisal, focusing instead on the primary effect examined in this paper: the modulatory influence of cognitive load on surprisal. Using this approach, we observed highly consistent results between the original online subsample and the online replication sample.
(4) Potential age bias in GPT-2.
We thank Reviewer 2 for their thoughtful and constructive feedback and agree that a potential age bias in GPT-2’s next-token predictions warrants caution. We thus added a section in the Discussion explicitly considering this limitation, and explain why it should not affect the implications of our study.
Reviewer #1 (Recommendations for the authors):
The d-prime model operates at the block level. How many observation goes into the fitting (about 175*8=1050)? How can the degrees of freedom of a certain variable go up to 188435?
We thank the reviewer for spotting this issue. Indeed, there was an error in our initial calculations, which we have now corrected in the manuscript. Importantly, the correction does not meaningfully affect the results for the analysis of d-primes or the conclusions of the study (see line 102).
“A linear mixed-effects model revealed n-back performance declined with cognitive load (β = -1.636, t(173.13) = -26.120, p < 0.001), with more pronounced effects with advancing age (β = -0.014, t(169.77) = -3.931, p > 0.001; Fig. 3b, Table S1)”.
Consider spelling out all the "simple coding schemes" explicitly.
We thank the reviewer for this helpful suggestion. In the revised manuscript, we have now included the modelled contrasts in brackets after each predictor variable.
“Example from line 527: In both models, we included recording location (online vs. lab), cognitive load (1-back and 2back Dual Task vs. Reading Only as the reference level) and continuously measured age (centred) in both models as well as the interaction of age and cognitive load as fixed effects”.
The relationship between comprehension accuracy and strategies for color judgement is unclear or not intuitive.
We thank the reviewer for this helpful comment. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the block-level comprehension questions. However, we agree that this distinction may not have been entirely clear, and we have now added a brief clarification in the Methods section to address this point (see line 534):
“Please note that we did not control for trial-level stimulus colour here. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the blocklevel comprehension questions”.
Could you explain why comprehension accuracy is not modeled in the same way as d-prime, i.e., with a similar set of predictors?
This is a very good point. After each block, participants answered three comprehension questions that were intentionally designed to be easy: they could all be answered correctly after having read the corresponding text, but not by common knowledge alone. The purpose of these questions was primarily to ensure participants paid attention to the texts and to allow exclusion of participants who failed to understand the material even under minimal cognitive load. As comprehension accuracy was modelled at the block level with 3 questions per block, participants could achieve only discrete scores of 0%, 33.3%, 66.7%, or 100%. Most participants showed uniformly high accuracy across blocks, as expected if the comprehension task fulfilled its purpose. However, this limited variance in performance caused convergence issues when fitting a comprehension-accuracy model at the same level of complexity as the d′ model. To model comprehension accuracy nonetheless, we therefore opted for a reduced model complexity in this analysis.
RT of previous word: The motivations described in the Methods, such as post-error-slowing and sequential modulation effects, lack supporting evidence. The actual scope of what this variable may account for is unclear.
We are happy to elaborate further regarding the inclusion of this predictor. Reading times, like many sequential behavioral measures, exhibit strong autocorrelation (Schuckart et al., 2025, doi: 10.1101/2025.08.19.670092). That is, the reading time of a given word is partially predictable from the reading time of the previous word(s). Such spillover effects can confound attempts to isolate trialspecific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the log reading time of the preceding trial as a covariate. This approach removes variance attributable to prior behavior, ensuring that the estimated effects reflect the influence of surprisal and cognitive load on the current word, rather than residual effects of preceding trials. We now added this explanation to the manuscript (see line 553):
“Additionally, it is important to consider that reading times, like many sequential behavioural measures, exhibit strong autocorrelation (Schuckart et al., 2025), meaning that the reading time of a given word is partially predictable from the reading time of the previous word. Such spillover effects can confound attempts to isolate trial-specific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the reading time of the preceding trial as a covariate”.
Block-level d-prime: It was shown with the d-prime performance model that block-level d-prime is a function of many of the reading-related variables. Therefore, it is not justified to use them here as "a proxy of each participant's working memory capacity."
We thank the reviewer for their comment. We would like to clarify that the d-prime performance model indeed included only dual-task d-primes (i.e., d-primes obtained while participants were simultaneously performing the reading task). In contrast, the predictor in question is based on singletask d-primes, which are derived from the n-back task performed in isolation. While dual- and singletask d-primes may be correlated, they capture different sources of variance, justifying the use of single-task d-primes here as a measure of each participant’s working memory capacity.
Word frequency is entangled with entropy and surprisal. Suggest removal.
We appreciate the reviewer’s comment. While word frequency is correlated with word surprisal, its inclusion does not affect the interpretation of the other predictors and does not introduce any bias. Moreover, it is a theoretically important control variable in reading research. Since we are interested in the effects of surprisal and entropy beyond potential biases through word length and frequency, we believe these are important control variables in our model. Moreover, checks for collinearity confirmed that word frequency was neither strongly correlated with surprisal nor entropy. In this sense, including it is largely pro forma: it neither harms the model nor materially changes the results, but it ensures that the analysis appropriately accounts for a well-established influence on word processing.
Entropy reflects the cognitive load of word prediction. It should be investigated in parallel and with similar depth as surprisal (which reflects the load of integration).
This is an excellent point that warrants further investigation, especially since the previous literature on the effects of entropy on reading time is scarce and somewhat contradictory. We have thus added additional analyses and now report the effects of cognitive load, entropy, and age on reading time (see sections “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Results, “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Methods as well as Fig. S7 and Table S6 in the Supplements for full results). In brief, we observe a significant three-way interaction among age, cognitive load, and entropy. Specifically, while all participants benefit from low entropy under high cognitive load, reflected by shorter reading times, in the baseline condition this benefit is observed only in older adults. Interestingly, in the baseline condition with minimal cognitive load, younger adults even show a benefit from high entropy. Thus, although the overall pattern for entropy partly mirrors that for surprisal – older adults showing increased reading times when word entropy is high and generally greater sensitivity to entropy variations – the effects differ in one important respect. Unlike for surprisal, the detrimental impact of increased word entropy is more pronounced under high cognitive load across all participants.
Reviewer #2 (Recommendations for the authors):
I agree in relation to prediction/load, but I am concerned (actually very concerned) that prediction needs to be assessed with respect to age. I suspect this is one reason why there is so much inconsistency in the effects of age in prediction and, indeed, comprehension more generally. I think the authors should either deal with it appropriately or drop it from the manuscript.
Thank you for raising this important concern. It is true that prediction is a highly individual, complex process as it depends upon the experiences a person has made with language over their lifespan. As such, one-size-fits-all approaches are not sufficient to model predictive processing. In our study, we thus took particular care to ensure that our analyses captured both age-related and other interindividual variability in predictive processing.
First, in our statistical models, we included age not only as a nuisance regressor, but also assessed age-related effects in the interplay of surprisal and cognitive load. By doing so, we explicitly model potential age-related differences in how individuals of different ages predict language under different levels of cognitive load.
Second, we hypothesised that predictive processing might also be influenced by a range of interindividual factors beyond age, including language exposure, cognitive ability, and more transient states such as fatigue. To capture such variability, all models included by-subject random intercepts and slopes, ensuring that unmodelled individual differences were statistically accommodated.
Together, these steps allow us to account for both systematic age-related differences and residual individual variability in predictive processing. We are therefore confident that our findings are not confounded by unmodelled age-related variability.
Line 18, do not confuse prediction (or pre-activation) with predictability. Predictability effects can be due to integration difficulty. See Pickering and Gambi 2018 for discussion. The discussion then focuses on graded parallel predictions, but there is also a literature concerned with the prediction of one word, typically using the "visual world" paradigm (which is barely cited - Reference 60 is an exception). In the next paragraph, I would recommend discussing the N400 literature (particularly Federmeier). There are a number of reading time studies that investigate whether there is a cost to a disconfirmed prediction - often finding no cost (e.g., Frisson, 2017, JML), though there is some controversy and apparent differences between ERP and eye-tracking studies (e.g., Staub). This literature should be addressed. In general, I appreciate the value of a short introduction, but it does seem too focused on neuroscience rather than the very long tradition of behavioural work on prediction and predictability.
We thank the reviewer for this suggestion. In the revised manuscript, we have clarified the relevant section of the introduction to avoid confusion between predictability and predictive processing, thereby improving conceptual clarity (see line 16).
“Instead, linguistic features are thought to be pre-activated broadly rather than following an all-or-nothing principle, as there is evidence for predictive processing even for moderately- or low-restraint contexts (Boston et al., 2008; Roland et al., 2012; Schmitt et al., 2021; Smith & Levy, 2013)”.
We also appreciate the reviewer’s comment regarding the introduction. While our study is behavioural, we frame it in a neuroscience context because our findings have direct implications for understanding neural mechanisms of predictive processing and cognitive load. We believe that this framing is important for situating our results within the broader literature and highlighting their relevance for future neuroscience research.
I don't think 2 two-word context is enough to get good indicators of predictability. Obviously, almost anything can follow "in the", but the larger context about parrots presumably gives a lot more information. This seems to me to be a serious concern - or am I misinterpreting what was done?
This is a very important point and we thank the reviewer for raising it. Our goal was to generate word surprisal scores that closely approximate human language predictions. In the manuscript, we report analyses using a 2-word context window, following recommendations by Kuribayashi et al. (2022).
To evaluate the impact of context length, we also tested longer windows of up to 60 words (not reported). While previous work (Goldstein et al., 2022) shows that GPT-2 predictions can become more human-like with longer context windows, we found that in our stimuli – short newspaper articles of only 300 words – surprisal scores from longer contexts were highly correlated with the 2word context, and the overall pattern of results remained unchanged. To illustrate, surprisal scores generated with a 10-word context window and surprisal scores generated with the 2-word context window we used in our analyses correlated with Spearman’s ρ = 0.976.
Additionally, on a more technical note, using longer context windows reduces the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window (e.g., a 50-word context would exclude ~17% of the data).
Importantly, while a short 2-word context window may introduce additional noise in the surprisal estimates, this would only bias effects toward zero, making our analyses conservative rather than inflating them. Critically, the observed effects remain robust despite this conservative estimate, supporting the validity of our findings.
However, we agree that this is a particularly important and sensitive point, and have now added a discussion of it to the manuscript (see line 476).
“Entropy and surprisal scores were estimated using a two-word context window. While short contexts have been shown to enhance GPT-2’s psychometric alignment with human predictions, making next-word predictions more human-like (Kuribayashi et al., 2022), other work suggests that longer contexts can also increase model–human similarity (Goldstein et al., 2022). To reconcile these findings in our stimuli and guide the choice of context length, we tested longer windows and found surprisal scores were highly correlated with the 2-word context (e.g., 10-word vs. 2-word context: Spearman’s ρ = 0.976), with the overall pattern of results unchanged. Additionally, employing longer context windows would have also reduced the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window. Crucially, any additional noise introduced by the short context biases effect estimates toward zero, making our analyses conservative rather than inflating them”.
Line 92, task performance, are there interactions? Interactions would fit with the experimental hypotheses.
Yes, we did include an interaction term of age and cognitive load and found significant effects on nback task performance (d-primes; b = -0.014, t(169.8) = -3.913, p < 0.001), but not on comprehension question accuracy (see table S1 and Fig. S2 in the supplementary material).
Line 149, what were these values?
We found surprisal values ranged between 3.56 and 72.19. We added this information in the manuscript (see line 143).
eLife Assessment
M proteins are essential group A streptococci virulence factors that bind to numerous human proteins; a small subset of M proteins, such as M3, have been reported to bind collagen, which is thought to promote tissue adherence. In this important paper, the authors provide a solid characterization of M3 interactions with collagen. The work raises significant questions regarding the specificity of the structure and its interactions with different collagens, with implications for the variable actions of M protein collagen interactions on biofilm formation.
Reviewer #1 (Public review):
Summary:
Wojnowska et al. report structural and functional studies of the interaction of Streptococcus pyogenes M3 protein with collagen. They show through X-ray crystallographic studies that the N-terminal hypervariable region of M3 protein forms a T-like structure, and that the T-like structure binds a three-stranded collagen-mimetic peptide. They indicate that the T-like structure is predicted by AlphaFold3 with moderate confidence level in other M proteins that have sequence similarity to M3 protein and M-like proteins from group C and G streptococci. For some, but not all, of these related M and M-like proteins, AlphaFold3 predicts, with moderate confidence level, complexes similar to the one observed for M3-collagen. Functionally, the authors show that emm3 strains form biofilms with more mass when surfaces are coated with collagen, and this effect can be blocked by an M3 protein fragment that contains the T-structure. They also show the co-occurrence of emm3 strains and collagen in patient biopsies and a skin tissue organoid. Puzzlingly, M1 protein has been reported to bind collagen, but collagen inhibits biofilm in a particular emm1 strain but that same emm1 strain colocalizes with collagen in a patient biopsy sample. The implications of the variable actions of collagen on biofilm formation are not clear.
Strengths:
The paper is well written and the results are presented in a logical fashion.
Weaknesses:
A major limitation of the paper is that it is almost entirely observational and lacks detailed molecular investigation. Insufficient details or controls are provided to establish the robustness of the data.
Comments on revisions:
The authors' response to this reviewer's Major issue #1 is inadequate. Their argument is essentially that if they denature the protein, then there is no activity. This does not address the specificity of the structure or its interactions.
They went only part way to addressing this reviewer's Major issue #2. While Figure 8 - supplement 3 shows 1D NMR spectra for M3 protein (what temperature?), it does not establish that stability is unaltered (to a significant degree).
This reviewer's Major issue #3 is one of the major reasons for considering this study to be observational. This reviewer agrees that structural biology is by its nature observational, but modern standards require validation of structural observations. The authors' response is that a mechanistic investigation involving mutant bacterial strains and validation involving mutated proteins is beyond their scope. Therefore, the study remains observational.
Major issue 4 was addressed suitably, but brings up the problematic point that the emm1 2006 strain colocalizes quite well with collagen in a patient biopsy sample but not in other assays. This calls into question the overall interpretability of the patient biopsy data.
The authors have not provided a point-by-point response. Issues that were indicated to be minor previously were deemed to be minor because this reviewer thought that they could easily be addressed in a revision. It appears that the authors have ignored many of these comments, and these issues are therefore now considered to be major issues. For example, no errors are given for Kd measurements, Table 2 is sloppy and lacks the requested information, negative controls are missing (Figure 10 - figure supplement 1), and there is no indication of how many independent times each experiment was done.
And "C4-binding protein" should be corrected to "C4b-binding protein."
Reviewer #2 (Public review):
Streptococcus pyogenes, or group A streptococci (GAS) can cause diseases ranging skin and mucosal infections, plasma invasion, and post-infection autoimmune syndromes. M proteins are essential GAS virulence factors that include an N-terminal hypervariable region (HVR). M proteins are known to bind to numerous human proteins; a small subset of M proteins were reported to bind collagen, which is thought to promote tissue adherence. In this paper, authors characterize M3 interactions with collagen and its role in biofilm formation. Specifically, they screened different collagen type II and III variants for full-length M3 protein binding using an ELISA-like method, detecting anti-GST antibody signal. By statistical analysis, hydrophobic amino acids and hydroxyproline found to positively support binding, whereas acidic residues and proline negatively impacted binding. The authors applied X-ray crystallography to determine the structure of the N-terminal domain (42-151 amino acids) of M3 protein (M3-NTD). M3-NTD dimmer (PDB 8P6K) forms a T-shaped structure with three helices (H1, H2, H3), which are stabilized by a hydrophobic core, inter-chain salt bridges and hydrogen bonds on H1, H2 helices, and H3 coiled coil. The conserved Gly113 serves as the turning point between H2 and H3. The M3-NTD is co-crystalized with a 24-residue peptide, JDM238, to determine the structure of M3-collagen binding. The structure (PDB 8P6J) shows that two copies of collagen in parallel bind to H1 and H2 of M3-NTD. Among the residues involved binding, conserved Try96 is shown to play a critical role supported by structure and isothermal titration calorimetry (ITC). The authors also apply a crystal-violet assay and fluorescence microscopy to determine that M3 is involved in collagen type I binding, but not M1 or M28. Tissue biopsy staining indicates that M3 strains co-localize with collagen IV-containing tissue, while M1 strains do not. The authors provide generally compelling evidence to show that GAS M3 protein binds to collagen, and plays a critical role in forming biofilms, which contribute to disease pathology. This is a very well-executed study and a well-written report relevant to understanding GAS pathogenesis and approaches to combatting disease; data are also applicable to emerging human pathogen Streptococcus dysgalactiae. One caveat that was not entirely resolved is if/how different collagen types might impact M3 binding and function. Due to the technical constrains, the in vitro structure and other binding assays use type II collagen whereas in vivo, biofilm formation assays and tissue biopsy staining use type I and IV collagen; it was unclear if this difference is significant. One possibility is that M3 has an unbiased binding to all types of collagens, only the distribution of collagens leads to the finding that M3 binds to type IV (basement membrane) and type I (varies of tissue including skin), rather than type II (cartilage).
Comments on revisions:
We are glad to see that the authors addressed our prior comments on M3 binding to different types of collagens in discussion section; adding a prediction of M3 binding to type I collagen (Figure 8-figure supplement 1B and 1C) is helpful to fill in the gap. Although it would be nice to experimentally fill in the gap by putting all types of collagens into one experiment (For example, like Figure 9A, use different types of human collagens to test biofilm formation; or Figure 10, use different types of human collagens to compete for biofilm formation), this appears to be beyond the scope of this paper. Meanwhile, the changes they have made are constructive.
The authors have addressed the majority of our prior comments.
Author response:
The following is the authors’ response to the current reviews.
We thank the reviewers for their comments on the initial submission, which helped us improve and extend the paper. We would like to respond specifically to reviewer #1.
We disagree with the broad criticism of this study as being “almost entirely observational” and lacking “detailed molecular investigation”. We report structures and binding data, show mechanistic detail, identify critical residues and structural features underlying biological activity, and present biologically meaningful data demonstrating a role of the interaction of the M3 protein with collagens. We disagree that insufficient details or controls are included. We agree that our report has limitations, such as an understanding of potential emm1 strain binding to collagen, which might play a role in host tissue colonization, but not in biofilm.
In response to issues raised in the initial review, we conducted several new experiments for the revised manuscript. We believe these strengthen what we report. Firstly, as the reviewer suggested, we conducted a binding experiment where the tertiary fold of M3-NTD was disrupted to confirm the T-shaped fold is indeed required for binding to collagen, as might be expected based on the crystal structure of the complex. To achieve this, we did not, as the reviewer states, use denatured protein in the ITC binding experiment. Instead, we used a monomeric form of M3-NTD, which does not adopt a well-defined tertiary structure, but retains all residues in the context of alpha helices. Secondly, we added more evidence for the importance of structural features (amino acid side chains defining the collagen binding site) by analysing the role of Trp103. Together, we provide clear evidence for the specific role of the T-shaped fold of M3-NTD for collagen binding.
Responding to a constructive criticism by reviewer #1 we characterised M3-NTD mutants to demonstrate conservation of overall structure. NMR is an exquisite tool for this as it is highly sensitive to structural changes. It is not clear why the reviewer suggested we should have measured the stability of the proteins, which is irrelevant here. What matters is that the fold is conserved between mutated variants at the chosen experimental temperature (now added to the Methods section), which NMR demonstrates.
We added errors for the ITC-derived dissociation constants.
In the submitted versions of the paper we did not include the negative control requested by reviewer #1 for experiments shown in Figure 10 - figure supplement 1B. In our view this does not add information supporting our findings. However, we have now added two negative controls, staining of emm1 and emm28 strains. As expected, no reactivity was found with the type-specific M3 HVR antiserum while the M3 BCW antiserum showed weak reactivity, in line with some sequence similarity of the C-terminal regions of M proteins.
Table 2 contains essential information, in line with what generally is shown in crystallographic tables in this journal. All other information can be found in the depositions of our data at the PDB. The structures have been scrutinised and checked by the PDB and passed all quality tests.
We stated how many times experiments were done where appropriate. We now added this information for CLC assays (as given in the previously published protocol, refs. 45, 47). ITC was carried out more than once for optimization but the results of single experiments are shown (as is common practice).
The following is the authors’ response to the original reviews.
Many thanks for assessing our submission. We are grateful for the reviews that have informed a revised version of the paper, which includes additional data and modified text to take into account the reviewers’ comments.
We addressed the major limitation identified by Reviewer #1 by including data to demonstrate that collagen binding is indeed dependent on the T-shaped fold (major issue 1). Reviewer #1 suggested this needs to be done through extensive mutational work. This in our view was neither feasible nor necessary. Instead, we used ITC to measure collagen peptide binding using a monomeric form of M3, which preserves all residues including the ones involved in binding, but cannot form the T-shaped structure. This achieves the same as unravelling the T fold through mutations, but without the risk of aJecting binding through altering residues that are involved in both binding and definition of the T fold. The experiment shows a very weak interaction, confirming the fold of the M3-NTD is required for binding activity.
Reviewer #1 finds the study limited for being “almost entirely observational”. Structural biology is by its nature observational, which is not a limitation but the very purpose of this approach. Our study goes beyond observing structures. In the first version of our paper, we identified a critical residue within a previously mapped binding site, and demonstrated through mutagenesis a causal link between presence of this residue on a tertiary fold and collagen binding activity. However, we agree this analysis could have been strengthened by additional mutagenesis, which we carried out and describe in the revised manuscript. This identifies a second residue that is critical for collagen binding. We firmed up these mutational experiments with a characterisation of mutated forms of M3 by NMR spectroscopy to confirm that these mutations did not aJect the overall fold, addressing major issue no. 2 of reviewer #1. We further demonstrate that the interaction between M3 and collagen is the cause of greatly enhanced biofilm formation as observed in patient biopsies and a tissue model of infection. We show that other streptococci that do not possess a surface protein presenting collagen binding sites like M3 do not form collagen-dependent biofilm. We therefore do not think that criticising our study for being almost entirely observational is valid.
Major issue 3:
We agree with the reviewer that it would be useful to carry out experiments with k.o. and complemented strains. Such experiments go beyond the scope of our study, but might be carried out by us or others in the future. We disagree that emm1 is used “as a negative”. Instead, we established that, in contrast to emm3 strains, emm1 strain biofilm formation is not enhanced by collagen.
We addressed major issue 4 by quantifying colocalizations in the patient biopsies and 3D tissue model experiments.
We thank Reviewer #2 for the thorough analysis of our reported findings. The main criticism here (issue 1) concerns the question of whether binding of emm3 streptococci would diJer to diJerent types of collagen. Our collagen peptide binding assays together with the structural data identify the collagen triple helix as the binding site for M3. While collagen types diJer in their distribution, functions and morphology in diJerent tissues, they all have in common triple-helical (COL) regions with high sequence similarity that are non-specifically recognised by M3. Therefore, our data in conjunction with the body of published work showing binding to M3 to collagens I, II, III and IV suggest it is highly likely that emm3 streptococci will indeed bind to all types of collagen in the same manner. We added a statement to the manuscript to make this point more clearly. We also added a prediction of a complex between M3 and a collagen I triple-helical peptide, which supports the idea of conserved binding mechanism for all collagen types. Whether this means all collagen types in the various tissues where they occur are targeted by emm3 streptococci is a very interesting question, however one that goes beyond the scope of our study.
Minor issues identified by the reviewers were addressed through changes in the text and addition of figures.
Summary of changes:
(1) Two new authors have been added due to inclusion of additional data and analysis.
(2) New experimental data included in section "M3-NTD harbors the collagen binding site".
(3) Figure 3 panels A and B assigned and swapped.
(4) Figure 4 changed to include new data and move mutant M3-NTD ITC graphs to supplement.
(5) Table 2 corrected and amended.
(6) AlphaFold3 quality parameters ipTM and pTM added to all figures showing predicted structures.
(7) New supplementary figure added showing crystal packing of M3-NTD/collagen peptide complex.
(8) Figure supplement of predicted M-protein/collagen peptide complexes includes new panel for a type I collagen peptide bound to M3.
(9) New figure supplement showing mutant M3-NTD ITC data.
(10) New figure supplement showing 1D <sup>1</sup>H NMR spectra of M3-NTD mutants.
(11) Included data for additional M3-NTD mutants assessing role of Trp103 in collagen binding. Text extended to describe and place into context findings from ITC binding studies using these mutants.
(12) Added quantitative analysis of biopsy and tissue model data (Mander's overlap coeJicient).
(13) Corrected and extended table 3 to take into account new primers.
(14) Added experimental details for new NMR and ITC experiments as well as new quantitative image analysis.
(15) Minor adjustments to the text to improve clarity and correct errors.
eLife Assessment
This important study introduces the Life Identification Number (LIN) coding system as a powerful and versatile approach for classifying Neisseria gonorrhoeae lineages. The authors show that LIN codes capture both previously defined lineages and their relationships in a way that aligns with the species' phylogenetic structure. The compelling evidence presented, together with its integration into the PubMLST platform, underscores its strong potential to enhance epidemiological surveillance and advance our understanding of gonococcal population biology.
Reviewer #3 (Public review):
Summary:
In this well-written manuscript, Unitt and colleagues propose a new, hierarchical nomenclature system for the pathogen Neisseria gonorrhoeae. The proposed nomenclature addresses a longstanding problem in N. gonorrhoeae genomics, namely that the highly recombinant population complicates typing schemes based on only a few loci and that previous typing systems, even those based on the core genome, group strains at only one level of genomic divergence without a system for clustering sequence types together. In this work, the authors have revised the core genome MLST scheme for N. gonorrhoeae and devised life identification numbers (LIN) codes to describe the N. gonorrhoeae population structure.
Strengths:
The LIN codes proposed in this manuscript are congruent with previous typing methods for Neisseria gonorrhoeae like cgMLST groups, Ng-STAR, and NG-MAST. Importantly, they improve upon many of these methods as the LIN codes are also congruent with the phylogeny and represent monophyletic lineages/sublineages. Additionally, LIN code cluster assignment is fixed, and clusters are not fused as is common in other typing schemes.
The LIN code assignment has been implemented in PubMLST allowing other researchers to assign LIN codes to new assemblies and put genomes of interest in context with global datasets, including in private datasets.
Weaknesses:
The authors have defined higher resolution thresholds for the LIN code scheme. However, they do not investigate how these levels correspond to previously identified transmission clusters from genomic epidemiology studies. This will be an important focus of future work, but it may be beyond the scope of the current manuscript.
Comments on revisions:
The authors have addressed my previous comments. I have no additional recommendations.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
Summary:
Bacterial species that frequently undergo horizontal gene transfer events tend to have genomes that approach linkage equilibrium, making it challenging to analyze population structure and establish the relationships between isolates. To overcome this problem, researchers have established several effective schemes for analyzing N. gonorrhoeae isolates, including MLST and NG-STAR. This report shows that Life Identification Number (LIN) Codes provide for a robust and improved discrimination between different N. gonorrhoeae isolates.
Strengths:
The description of the system is clear, the analysis is convincing, and the comparisons to other methods show the improvements offered by LIN Codes.
Weaknesses:
No major weaknesses were identified by this reviewer.
We thank the reviewer for their assessment of our paper.
Reviewer #2 (Public review):
Summary:
This paper describes a new approach for analyzing genome sequences.
Strengths:
The work was performed with great rigor and provides much greater insights than earlier classification systems.
Weaknesses:
A minor weakness is that the clinical application of LIN coding could be articulated in a more in-depth way. The LIN coding system is very impressive and is certainly superior to other protocols. My recommendation, although not necessary for this paper, is that the authors expand their analysis to noncoding sequences, especially those upstream of open reading frames. In this respect, important cis-acting regulatory mutations that might help to further distinguish strains could be identified.
We thank the reviewer for their comments. LIN code could be applied clinically, for example in the analysis of antibiotic resistant isolates, or to investigate outbreaks associated with a particular lineage. We have updated the text to note this, starting at line 432.
In regards to non-coding sequences: unfortunately, intergenic regions are generally unsuitable for use in typing systems as (i) they are subject to phase variation, which can occlude relationships based on descent; (ii) they are inherently difficult to assemble and therefore can introduce variation due to the sequencing procedure rather than biology. For the type of variant typing that LIN code represents, which aims to replicate phylogenetic clustering, protein encoding sequences are the best choice for convenience, stability, and accuracy. This is not to say that it is not a valid object to base a nomenclature on intergenic regions, which might be especially suitable for predicting some phenotypic characters, but this will still be subject to problem (ii), depending on the sequencing technology used. Such a nomenclature system should stand beside, rather than be combined with or used in place of, phylogenetic typing. However, we could certainly investigate the relationship between an isolates LIN code and regulatory mutations in the future.
Reviewer #3 (Public review):
Summary:
In this well-written manuscript, Unitt and colleagues propose a new, hierarchical nomenclature system for the pathogen Neisseria gonorrhoeae. The proposed nomenclature addresses a longstanding problem in N. gonorrhoeae genomics, namely that the highly recombinant population complicates typing schemes based on only a few loci and that previous typing systems, even those based on the core genome, group strains at only one level of genomic divergence without a system for clustering sequence types together. In this work, the authors have revised the core genome MLST scheme for N. gonorrhoeae and devised life identification numbers (LIN) codes to describe the N. gonorrhoeae population structure.
Strengths:
The LIN codes proposed in this manuscript are congruent with previous typing methods for Neisseria gonorrhea, like cgMLST groups, Ng-STAR, and NG-MAST. Importantly, they improve upon many of these methods as the LIN codes are also congruent with the phylogeny and represent monophyletic lineages/sublineages.
The LIN code assignment has been implemented in PubMLST, allowing other researchers to assign LIN codes to new assemblies and put genomes of interest in context with global datasets.
Weaknesses:
The authors correctly highlight that cgMLST-based clusters can be fused due n to "intermediate isolates" generated through processes like horizontal gene transfer. However, the LIN codes proposed here are also based on single linkage clustering of cgMLST at multiple levels. It is unclear if future recombination or sequencing of previously unsampled diversity within N. gonorrhoeae merges together higher-level clusters, and if so, how this will impact the stability of the nomenclature.
The authors have defined higher resolution thresholds for the LIN code scheme. However, they do not investigate how these levels correspond to previously identified transmission clusters from genomic epidemiology studies. It would be useful for future users of the scheme to know the relevant LIN code thresholds for these investigations.
We thank the reviewer for their insightful comments. LIN codes do use multi-level single linkage clustering to define the cluster number of isolates. However, unlike previous applications of simple single linkage clustering such as N. gonorrhoeae core genome groups (Harrison et al., 2020), once assigned in LIN code, these cluster numbers are fixed within an unchanging barcode assigned to each isolate. Therefore, the nomenclature is stable, as the addition of new isolates cannot change previously established LIN codes.
Cluster stability was considered during the selection of allelic mismatch thresholds. By choosing thresholds based on natural breaks in population structure (Figure 3), applying clustering statistics such as the silhouette score, and by assessing where cluster stability has been maintained within the previous core genome groups nomenclature, we can have confidence that the thresholds which we have selected will form stable clusters. For example, with core genome groups there has been significant group fusion with clusters formed at a threshold of 400 allelic differences, while clustering at a threshold of 300 allelic differences has remained cohesive over time (supported by a high silhouette score) and so was selected as an important threshold in the gonococcal LIN code. LIN codes have now been applied to >27000 isolates in PubMLST, and the nomenclature has remained effective despite the continual addition of new isolates to this collection. The manuscript emphasises these points at line 96 and 346.
Work is in progress to explore what LIN code thresholds are generally associated with transmission chains. These will likely be the last 7 thresholds (25, 10, 7, 5, 3, 1, and 0 allelic differences), as previous work has suggested that isolates linked by transmission within one year are associated with <14 single nucleotide polymorphism differences (De Silva et al., 2016). The results of this analysis will be described in a future article, currently in preparation.
Harrison, O.B., et al. Neisseria gonorrhoeae Population Genomics: Use of the Gonococcal Core Genome to Improve Surveillance of Antimicrobial Resistance. The Journal of Infectious Diseases 2020.
De Silva, D., et al. Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study. The Lancet Infectious Diseases 2016;16(11):1295-1303.
Reviewer #3 (Recommendations for the authors):
(1) Data/code availability: While the genomic data and LIN codes are available in PubMLST and new isolates uploaded to PubMLST can be assigned a LIN code, it is also important to have software version numbers reported in the methods section and code/commands associated with the analysis in this manuscript (e.g. generation of core genome, statistical analysis, comparison with other typing methods) documented in a repository like GitHub.
Software version numbers have been added to the manuscript. Scripts used to run the software have been compiled and documented on protocols.io, DOI: dx.doi.org/10.17504/protocols.io.4r3l21beqg1y/v1
(2) Line 37: Missing "a" before "multi-drug resistant pathogen".
This has been corrected in the text.
(3) Line 60: Typo in geoBURST.
The text refers to a tool called goeBURST (global optimal eBURST) as described in Francisco, A.P. et al., 2009. DOI: 10.1186/1471-2105-10-152. Therefore, “geoBURST” would be incorrect.
(4) Line 136-138: It might be helpful to discuss how premature stop codons are treated in this scheme. Often in isolates with alleles containing early premature stop codons, annotation software like prokka will annotate two separate ORFs, which are then clustered with pangenome software like PIRATE. How does the cgMLST scheme proposed here treat premature stop codons? Are sequences truncated at the first stop codon, or is the nucleotide sequence for the entire gene used even if it is out of frame?
In PubMLST, alleles with premature stop codons are flagged, but otherwise annotated from the typical start to the usual stop codon, if still present. This also applies to frameshift mutations – a new unique allele will be annotated, but flagged as frameshift. In both cases, each new allele with a premature stop codon or frameshift will require human curator involvement to be assigned, to ensure rigorous allele assignment. As the Ng cgMLST v2 scheme prioritised readily auto-annotated genes, loci which are prone to internal stop codons or frameshifts with inconsistent start/end codons are excluded from the scheme. The text has been updated at line 128 to mention this.
(5) Line 213-214: What were the versions of software and parameters used for phylogenetic tree construction?
Version numbers have been added to the text between lines 214-219. Parameters have been included with the scripts documented at protocols.io DOI: dx.doi.org/10.17504/protocols.io.4r3l21beqg1y/v1
(6) Line 249: K. pneumoniae may also be a more diverse/older species than N. gonorrhoeae.
The text has been updated at line 252-253 to emphasize the difference in diversity. The age of N. gonorrhoeae as a species is a matter of scientific debate, and out of the scope of this paper to discuss.
(7) Line 278-279: Were some isolates unable to be typed, or have they just been added since the LIN code assignment occurred?
Some genomes cannot be assigned a LIN code due to poor genome quality. A minimum of 1405/1430 core genes must have an allele designated for a LIN code to be assigned. Genomes with large numbers of contigs may not meet this requirement. LIN code assignment is an ongoing process that occurs on a weekly basis in PubMLST, performed in batches starting at 23:00 (UK local time) on Sundays. The text has been updated to describe this at lines 196 and 282-283.
(8) Line 314-315: Was BAPS rerun on the dataset used in this manuscript, or is this based on previously assigned BAPS groups?
This was based on previously assigned BAPs groups, as described between lines 315-320.
(9) Line 421-423: Are there options for assigning LIN codes that do not require uploading genomes to PubMLST? I can imagine that there may be situations where researchers or public health institutions cannot share genomic data prior to publication.
Isolate data does not need to be shared to be uploaded and assigned a LIN code in PubMLST. data owners can create a private dataset within PubMLST viewable only to them, on which automated assignment will be performed. LIN code requires a central repository of genomes for new codes to be assigned in relation to. The text has been updated to emphasize this at line 197 and 427.
(10) Figure 6: How is this tree rooted? Additionally, do isolates that have unannotated LIN codes represent uncommon LIN codes or were those isolates not typed?
The tree has been left unrooted, as it is being used to visualise the relationships between the isolates rather than to explore ancestry. Detail on what LIN codes have been annotated can be found in the figure legend, which describes that the 21 most common LIN code lineages in this 1000 isolate dataset have been labelled. All 1000 isolates used in the tree had a LIN code assigned, but to ensure good legibility not all lineages were annotated on the tree. The legend has been updated to improve clarity.
eLife Assessment
This valuable study uses zebrafish as a model to reveal a role for the cell cycle protein kinase CDK2 as a negative regulator of type I interferon signaling. The evidence supporting the authors' claims is convincing, including both in vivo and in vitro investigative approaches that corroborate a role for CDK2 in regulating TBK1 degradation. In this latest version, the authors included data addressing a concern raised by the reviewer in the previous peer review round. This work will interest cell biologists, immunologists, and virologists.
Reviewer #1 (Public review):
Summary:
The authors set out to evaluate the regulation of interferon (IFN) gene expression in fish, using mainly zebrafish as a model system. Similar to more widely characterized mammalian systems, fish IFN is induced during viral infection through the action of the transcription factor IRF3 which is activated by phosphorylation by the kinase TBK1. It has been previously shown in many systems that TBK1 is subjected to both positive and negative regulation to control IFN production. In this work, the authors find that the cell cycle kinase CDK2 functions as a TBK1 inhibitor by decreasing its abundance through recruitment of the ubiquitinylation ligase, Dtx4, which has been similarly implicated in the regulation of mammalian TBK1. Experimental data are presented showing that CDK2 interacts with both TBK1 and Dtx4, leading to TBK1 K48 ubiqutinylation on K567 and its subsequent degradation by the proteasome.
Strengths:
The strengths of this manuscript are its novel demonstration of the involvement of CDK2 in a process in fish that is controlled by different factors in other vertebrates and its clear and supportive experimental data.
Weaknesses:
The weaknesses of the study include the following. 1) It remains unclear how CDK is regulated during viral infection and how it specifically recruits E3 ligase to TBK1. The authors find that its abundance increases during viral infection, an unusual finding given that CDK2 levels are often found to be stable. How this change in abundance might affect cell cycle control was not explored. 2) The implications and mechanisms for a relationship between the cell cycle and IFN production will be a fascinating topic for future studies. In particular, it will be critical to determine if CDK2 catalytic activity is required. An experiment with an inhibitor suggests that this novel action of CDK2 is kinase independent, but the lack of controls showing the efficacy of the inhibitor prevents a firm conclusion. It will also be critical to determine if there is a role for cyclins in this process or if there is competition for binding between TBK1 and cyclin and, if so, if this has an impact on the cell cycle. Likewise, an impact of CDK2 induction by virus infection on normal cell cycling will be important to investigate.
Reviewer #2 (Public review):
Summary:
In this paper, the authors describe a novel function involving the cell cycle protein kinase CDK2, which binds to TBK1 (an essential component of the innate immune response) leading to its degradation in a ubiquitin/proteasome-dependent manner. Moreover, the E3 ubiquitin ligase, Dtx4, is implicated in the process by which CDK2 increases the K48-linked ubiquitination of TBK1. This paper presents intriguing findings on the function of CDK2 in lower vertebrates, particularly its regulation of IFN expression and antiviral immunity.
Strengths:
(1) The research employs a variety of experimental approaches to address a single question. The data are largely convincing and appear to be well executed.
(2) The evidence is strong and includes a combination of in vivo and in vitro experiments, including knockout models, protein interaction studies, and ubiquitination analyses.
(3) This study significantly impacts the field of immunology and virology, particularly concerning the antiviral mechanisms in lower vertebrates. The findings provide new insights into the regulation of IFN expression and the broader role of CDK2 in immune responses. The methods and data presented in this paper are highly valuable for the scientific community, offering new avenues for research into antiviral strategies and the development of therapeutic interventions targeting CDK2 and its associated pathways.
Author response:
The following is the authors’ response to the previous reviews.
Reviewer #1 (Public Review):
The weaknesses of the study include the following.
(1) It remains unclear how CDK is regulated during viral infection and how it specifically recruits E3 ligase to TBK1.
We would like to express our gratitude to the reviewer for highlighting this significant issue. The present study demonstrates that CDK2 expression is significantly upregulated upon SVCV infection in multiple fish tissues and cell lines (see Fig. 1C-F), thus suggesting that viral infection triggers CDK2 induction. However, the precise upstream signaling pathways that regulate CDK2 during viral infection remain to be fully elucidated. It is hypothesized that viral RNA sensors may activate transcription factors that bind to the cdk2 promoter; however, further investigation is required to confirm this. We have added a sentence in the Discussion (Lines 409-412) acknowledging this as a limitation and a focus for future work, suggesting potential involvement of viral sensor pathways.
With regard to the mechanism by which CDK2 recruits the E3 ligase Dtx4 to TBK1, evidence is provided that CDK2 directly interacts with both TBK1 (via its kinase domain) and Dtx4 (see Fig. 4F-I, 6A-C). Furthermore, evidence is presented demonstrating that CDK2 enhances the interaction between Dtx4 and TBK1 (Fig. 6D), thus suggesting that CDK2 functions as a scaffold protein to facilitate the formation of a ternary complex. However, further study is required to ascertain the precise structural basis of this interaction, including whether CDK2's kinase activity is required. We have added a note in the Discussion (Lines 417-421) acknowledging this limitation and proposing future structural studies to elucidate the precise binding interfaces.
(2) The implications and mechanisms for a relationship between the cell cycle and IFN production will be a fascinating topic for future studies.
We concur with the reviewer's assertion that the interplay between cell cycle progression and innate immunity constitutes a promising and under-explored research domain. Whilst the present study concentrates on the function of CDK2 in antiviral signaling, independent of its cell cycle functions, it is acknowledged that CDK2's activity is cell cycle-dependent. It is hypothesized that CDK2 may function as a molecular link between cell proliferation and immune responses, particularly in light of the observation that viral infections frequently modify host cell cycle progression. In the Discussion (lines 387-391), we now briefly propose a model wherein CDK2 activity during the S phase may suppress TBK1-mediated IFN production to allow viral replication, while CDK2 inhibition (e.g., in G1) may enhance IFN responses. This hypothesis will be the subject of our future work, including cell cycle synchronization experiments and time-course analyses of CDK2 activity and IFN output during infection.
Reviewer #1 (Recommendations for the authors):
(1) A control showing that the CDK2 inhibitor blocked kinase activity would be appropriate.
We thank the reviewer for this suggestion. We have performed experiments using the CDK2-specific inhibitor SNS-032. As shown in the Author response image 1, the treatment of EPC cells with SNS-032 (2 µM) still affect TBK1 expression. However, the selection of this inhibitor was based on literature references (ref. 1 and 2), and it is uncertain whether it directly inhibits the kinase activity of CDK2. However, our result demonstrated that CDK2 retains the capacity to degrade TBK1 even in the absence of its kinase domain (Fig. 6I), yielding outcomes that are consistent with this inhibitor.
Author response image 1.
References:
(1) Mechanism of action of SNS-032, a novel cyclin-dependent kinase inhibitor, in chronic lymphocytic leukemia. Blood. 2009 May 7;113(19):4637-45.
(2) SNS-032 is a potent and selective CDK 2, 7 and 9 inhibitor that drives target modulation in patient samples. Cancer Chemother Pharmacol. 2009 Sep;64(4):723-32.
eLife Assessment
The authors investigated the potential role of IgG N-glycosylation in Haemorrhagic Fever with Renal Syndrome (HFRS), which may offer significant insights for understanding molecular mechanisms and for the development of therapeutic strategies for this infectious disease. The findings are valuable to the field and the strength of evidence to support the findings is solid.
Reviewer #1 (Public review):
The authors investigated the potential role of IgG N-glycosylation in Haemorrhagic Fever with Renal Syndrome (HFRS), which may offer significant insights for understanding molecular mechanisms and for the development of therapeutic strategies for this infectious disease.
Reviewer #2 (Public review):
This work sought to explore antibody responses in the context of hemorrhagic fever with renal syndrome (HFRS) - a severe disease caused by Hantaan virus infection. Little is known about the characteristics or functional relevance of IgG Fc glycosylation in HFRS. To address this gap, the authors analyzed samples from 65 patients with HFRS spanning the acute and convalescent phases of disease via IgG Fc glycan analysis, scRNAseq, and flow cytometry. The authors observed changes in Fc glycosylation (increased fucosylation and decreased bisection) coinciding with a 4-fold or greater increased in Haantan virus-specific antibody titer. The study also includes exploratory analyses linking IgG glycan profiles to glycosylation-related gene expression in distinct B cell subsets, using single-cell transcriptomics. Overall, this is an interesting study that combines serological profiling with transcriptomic data to shed light on humoral immune responses in an underexplored infectious disease. The integration of Fc glycosylation data with single-cell transcriptomic data is a strength.
Author response:
The following is the authors’ response to the previous reviews
Reviewers 1:
Summary:
The authors investigated the potential role of IgG N-glycosylation in Haemorrhagic Fever with Renal Syndrome (HFRS), which may offer significant insights for understanding molecular mechanisms and for the development of therapeutic strategies for this infectious disease.
While the majority of the issues have been addressed, a few minor points still remain unresolved. Quality control should be conducted prior to the analysis of clinical samples. However, the coefficient of variation (CV) value was not provided for the paired acute and convalescent-phase samples from 65 confirmed HFRS patients, which were analyzed to assess inter-individual biological variability. It is important to note that biological replication should be evaluated using general samples, such as standard serum.
We thank the reviewer for this insightful and critical comment regarding the quality control of our analytical data and the assessment of biological variability. We agree that this is essential for validating the reliability of our findings. We have now provided the requested CV data and clarified this point in the revised manuscript as detailed below.
"This dual-replicate strategy enabled a comprehensive evaluation of both biological heterogeneity and assay precision, and the coefficient of variation for samples were below 16%." Please see the Materials and Methods (Page 16, lines 360-362, and Author response table 1).
Author response table 1.
Comparative analysis of serum biomarker concentrations in acute and convalescent phase cohorts.
Reviewers 2:
This work sought to explore antibody responses in the context of hemorrhagic fever with renal syndrome (HFRS) - a severe disease caused by Hantaan virus infection. Little is known about the characteristics or functional relevance of IgG Fc glycosylation in HFRS. To address this gap, the authors analyzed samples from 65 patients with HFRS spanning the acute and convalescent phases of disease via IgG Fc glycan analysis, scRNAseq, and flow cytometry. The authors observed changes in Fc glycosylation (increased fucosylation and decreased bisection) coinciding with a 4-fold or greater increased in Haantan virus-specific antibody titer. The study also includes exploratory analyses linking IgG glycan profiles to glycosylation-related gene expression in distinct B cell subsets, using single-cell transcriptomics. Overall, this is an interesting study that combines serological profiling with transcriptomic data to shed light on humoral immune responses in an underexplored infectious disease. The integration of Fc glycosylation data with single-cell transcriptomic data is a strength.The authors have addressed the major concerns from the initial review. However, one point to emphasize is that the data are correlative. While the associations between Fc glycosylation changes and recovery are intriguing, the evidence does not establish causation. This is not a weakness, as correlative studies can still be highly valuable and informative. However, the manuscript would be strengthened by making this distinction clear, particularly in the title.
The verb "accelerated" in the title implies that the glycosylation state of IgG was a direct driver of recovery, rather than something that correlated with recovery. Thus, a more neutral word/phrase would be ideal.
We sincerely thank the reviewer for this insightful suggestion. We agree that the use of "accelerated" might overstate the potential role of IgG glycosylation, which has not been clearly clarified by our current findings. As reported in results (particularly in Figure 2), partial glycosylation exhibits statistically significant variations between seropositive and seronegative statuses, before and after seroconversion, and across different HTNV- NP specific antibody titers. Therefore, we have replaced "accelerated" with "contribute to" in the Title: "Glycosylated IgG antibodies contribute to the recovery of haemorrhagic fever with renal syndrome patients".
eLife Assessment
This study presents a useful overview of the taxonomic composition of the microbiome associated with Dactylorhiza traunsteineri, a widely distributed orchid species in Central Europe. The evidence supporting the claims of the authors is incomplete, especially when it comes to the (secondary) metabolic pathways found in the metagenome assembled genomes, and requires more substantial analysis to be able to claim that these pathways play a key role in microbiome-orchid symbiosis.
Reviewer #1 (Public review):
Summary:
The microbiota of Dactylorhiza traunsteineri, an endangered marsh orchid, forms complex root associations that support plant health. Using 16S rRNA sequencing, we identified dominant bacterial phyla in its rhizosphere, including Proteobacteria, Actinobacteria, and Bacteroidota. Deep shotgun metagenomics revealed high-quality MAGs with rich metabolic and biosynthetic potential. This study provides key insights into root-associated bacteria and highlights the rhizosphere as a promising source of bioactive compounds, supporting both microbial ecology research and orchid conservation.
Strengths:
The manuscript presents an investigation of the bacterial communities in the rhizosphere of D. traunsteineri using advanced metagenomic approaches. The topic is relevant, and the techniques are up-to-date; however, the study has several critical weaknesses.
Weaknesses:
(1) Title: The current title is misleading. Given that fungi are the primary symbionts in orchids and were not analyzed in this study (nor were they included among other microbial groups), the use of the term "microbiome" is not appropriate. I recommend replacing it with "bacteriome" to better reflect the scope of the work.
(2) Line 124: The phrase "D. traunsteineri individuals were isolated" seems misleading. A more accurate description would be "individuals were collected", as also mentioned in line 128.
(3) Experimental design: The major limitation of this study lies in its experimental design. The number of plant individuals and soil samples analyzed is unclear, making it difficult to assess the statistical robustness of the findings. It is also not well explained why the orchids were collected two years before the rhizosphere soil samples. Was the rhizosphere soil collected from the same site and from remnants of the previously sampled individuals in 2018? This temporal gap raises serious concerns about the validity of the biological associations being inferred.
(4) Low sample size: In lines 249-251 (Results section), the authors mention that only one plant individual was used for identifying rhizosphere bacteria. This is insufficient to produce scientifically robust or generalizable conclusions.
(5) Contextual limitations: Numerous studies have shown that plant-microbe interactions are influenced by external biotic and abiotic factors, as well as by plant age and population structure. These elements are not discussed or controlled for in the manuscript. Furthermore, the ecological and environmental conditions of the site where the plants and soil were collected are poorly described. The number of biological and technical replicates is also not clearly stated.
(6) Terminology: Throughout the manuscript, the authors refer to the "microbiome," though only bacterial communities were analyzed. This terminology is inaccurate and should be corrected consistently.
Considering the issues addressed, particularly regarding experimental design and data interpretation, significant improvements to the study are needed.
Reviewer #2 (Public review):
Summary:
The authors aim to provide an overview of the D. traunsteineri rhizosphere microbiome on a taxonomic and functional level, through 16S rRNA amplicon analysis and shotgun metagenome analysis. The amplicon sequencing shows that the major phyla present in the microbiome belong to phyla with members previously found to be enriched in rhizospheres and bulk soils. Their shotgun metagenome analysis focused on producing metagenome assembled genomes (MAGs), of which one satisfies the MIMAG quality criteria for high-quality MAGs and three those for medium-quality MAGs. These MAGs were subjected to functional annotations focusing on metabolic pathway enrichment and secondary metabolic pathway biosynthetic gene cluster analysis. They find 1741 BGCs of various categories in the MAGs that were analyzed, with the high-quality MAG being claimed to contain 181 SM BGCs. The authors provide a useful, albeit superficial, overview of the taxonomic composition of the microbiome, and their dataset can be used for further analysis.
The conclusions of this paper are not well-supported by the data, as the paper only superficially discusses the results, and the functional interpretation based on taxonomic evidence or generic functional annotations does not allow drawing any conclusions on the functional roles of the orchid microbiota.
Weaknesses:
The authors only used one individual plant to take samples. This makes it hard to generalize about the natural orchid microbiome.
The authors use both 16S amplicon sequencing and shotgun metagenomics to analyse the microbiome. However, the authors barely discuss the similarities and differences between the results of these two methods, even though comparing these results may be able to provide further insights into the conclusions of the authors. For example, the relative abundance of the ASVs from the amplicon analysis is not linked to the relative abundances of the MAGs.
Furthermore, the authors discuss that phyla present in the orchid microbiome are also found in other microbiomes and are linked to important ecological functions. However, their results reach further than the phylum level, and a discussion of genera or even species is lacking. The phyla that were found have very large within-phylum functional variability, and reliable functional conclusions cannot be drawn based on taxonomic assignment at this level, or even the genus level (Yan et al. 2017).
Additionally, although the authors mention their techniques used, their method section is sometimes not clear about how samples or replicates were defined. There are also inconsistencies between the methods and the results section, for example, regarding the prediction of secondary metabolite biosynthetic gene clusters (BGCs).
The BGC prediction was done with several tools, and the unusually high number of found BGCs (181 in their high-quality MAG) is likely due to false positives or fragmented BGCs. The numbers are much higher than any numbers ever reported in literature supported by functional evidence (Amos et al, 2017), even in a prolific genus like Streptomyces (Belknap et al., 2020). This caveat is not discussed by the authors.
The authors have generated one high-quality MAG and three medium-quality MAGs. In the discussion, they present all four of these as high-quality, which could be misleading. The authors discuss what was found in the literature about the role of the bacterial genera/phyla linked to these MAGs in plant rhizospheres, but they do not sufficiently link their own analysis results (metabolic pathway enrichment and biosynthetic gene cluster prediction) to this discussion. The results of these analyses are only presented in tables without further explanation in either the results section or the discussion, even though there may be interesting findings. For example, the authors only discuss the class of the BGCs that were found, but don't search for experimentally verified homologs in databases, which could shed more light on the possible functional roles of BGCs in this microbiome.
In the conclusions, the authors state: "These analyses uncovered potential metabolic capabilities and biosynthetic potentials that are integral to the rhizosphere's ecological dynamics." I don't see any support for this. Mentioning that certain classes of BGCs are present is not enough to make this claim, in my opinion. Any BGC is likely important for the ecological niche the bacteria live in. The fact that rhizosphere bacteria harbour BGCs is not surprising, and it doesn't tell us more than is already known.
References:
Belknap, Kaitlyn C., et al. "Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria." Scientific reports 10.1 (2020): 2003
Amos GCA, Awakawa T, Tuttle RN, Letzel AC, Kim MC, Kudo Y, Fenical W, Moore BS, Jensen PR. Comparative transcriptomics as a guide to natural product discovery and biosynthetic gene cluster functionality. Proc Natl Acad Sci U S A. 2017 Dec 26;114(52):E11121-E11130.
References:
Belknap, Kaitlyn C., et al. "Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria." Scientific reports 10.1 (2020): 2003
Amos GCA, Awakawa T, Tuttle RN, Letzel AC, Kim MC, Kudo Y, Fenical W, Moore BS, Jensen PR. Comparative transcriptomics as a guide to natural product discovery and biosynthetic gene cluster functionality. Proc Natl Acad Sci U S A. 2017 Dec 26;114(52):E11121-E11130.
Yan Yan, Eiko E Kuramae, Mattias de Hollander, Peter G L Klinkhamer, Johannes A van Veen, Functional traits dominate the diversity-related selection of bacterial communities in the rhizosphere, The ISME Journal, Volume 11, Issue 1, January 2017, Pages 56-66
Author response:
Reviewer #1 (Public review):
The microbiota of Dactylorhiza traunsteineri, an endangered marsh orchid, forms complex root associations that support plant health. Using 16S rRNA sequencing, we identified dominant bacterial phyla in its rhizosphere, including Proteobacteria, Actinobacteria, and Bacteroidota. Deep shotgun metagenomics revealed high-quality MAGs with rich metabolic and biosynthetic potential. This study provides key insights into root-associated bacteria and highlights the rhizosphere as a promising source of bioactive compounds, supporting both microbial ecology research and orchid conservation.
The manuscript presents an investigation of the bacterial communities in the rhizosphere of D. traunsteineri using advanced metagenomic approaches. The topic is relevant, and the techniques are up-to-date; however, the study has several critical weaknesses.
We thank the reviewer for their careful reading of our manuscript and for the constructive comments. We will revise the manuscript substantially. Our responses to the specific points are below:
(1) Title: The current title is misleading. Given that fungi are the primary symbionts in orchids and were not analyzed in this study (nor were they included among other microbial groups), the use of the term "microbiome" is not appropriate. I recommend replacing it with "bacteriome" to better reflect the scope of the work.
In the revised manuscript, we will expand the Results (shotgun sequencing) and Discussion to also include fungal taxa. With these additions, the use of the term microbiome will accurately reflect the inclusion of both bacterial and fungal components.
(2) Line 124: The phrase "D. traunsteineri individuals were isolated" seems misleading. A more accurate description would be "individuals were collected", as also mentioned in line 128.
This ambiguity will be corrected in the revised manuscript.
(3) Experimental design: The major limitation of this study lies in its experimental design. The number of plant individuals and soil samples analyzed is unclear, making it difficult to assess the statistical robustness of the findings. It is also not well explained why the orchids were collected two years before the rhizosphere soil samples. Was the rhizosphere soil collected from the same site and from remnants of the previously sampled individuals in 2018? This temporal gap raises serious concerns about the validity of the biological associations being inferred.
In the revised manuscript, we will explicitly state the number of individuals and soil samples included in the study, and we will more clearly describe the sequence of sampling events. We will also add a dedicated statement in the Discussion addressing the temporal gap between plant sampling and rhizosphere soil collection, acknowledging that this is a limitation of the study.
(4) Low sample size: In lines 249-251 (Results section), the authors mention that only one plant individual was used for identifying rhizosphere bacteria. This is insufficient to produce scientifically robust or generalizable conclusions.
In the revised manuscript, we will clearly state that only one rhizosphere sample was available and will frame the study as exploratory in nature. We will explicitly acknowledge this limitation in both the Methods and Discussion, and we will temper our conclusions accordingly.
(5) Contextual limitations: Numerous studies have shown that plant-microbe interactions are influenced by external biotic and abiotic factors, as well as by plant age and population structure. These elements are not discussed or controlled for in the manuscript. Furthermore, the ecological and environmental conditions of the site where the plants and soil were collected are poorly described. The number of biological and technical replicates is also not clearly stated.
In the revised manuscript, we will expand the description of the collection site and environmental conditions to the extent supported by our records. We will also clearly state the number of biological and technical replicates used for each analysis. In the Discussion, we will explicitly acknowledge that plant age, environmental variables, and other biotic/abiotic factors may influence plant–microbe interactions and were not directly assessed in this study.
(6) Terminology: Throughout the manuscript, the authors refer to the "microbiome," though only bacterial communities were analyzed. This terminology is inaccurate and should be corrected consistently.
As noted in our response to point (1), we will revise terminology throughout the manuscript to ensure consistency and to accurately reflect the expanded bacterial and fungal coverage in the revised version.
Reviewer #2 (Public review):
The authors aim to provide an overview of the D. traunsteineri rhizosphere microbiome on a taxonomic and functional level, through 16S rRNA amplicon analysis and shotgun metagenome analysis. The amplicon sequencing shows that the major phyla present in the microbiome belong to phyla with members previously found to be enriched in rhizospheres and bulk soils. Their shotgun metagenome analysis focused on producing metagenome assembled genomes (MAGs), of which one satisfies the MIMAG quality criteria for high-quality MAGs and three those for medium-quality MAGs. These MAGs were subjected to functional annotations focusing on metabolic pathway enrichment and secondary metabolic pathway biosynthetic gene cluster analysis. They find 1741 BGCs of various categories in the MAGs that were analyzed, with the high-quality MAG being claimed to contain 181 SM BGCs. The authors provide a useful, albeit superficial, overview of the taxonomic composition of the microbiome, and their dataset can be used for further analysis.
The conclusions of this paper are not well-supported by the data, as the paper only superficially discusses the results, and the functional interpretation based on taxonomic evidence or generic functional annotations does not allow drawing any conclusions on the functional roles of the orchid microbiota.
We thank the reviewer for their thoughtful and constructive assessment of our manuscript. The comments have been very helpful in identifying areas where the clarity, structure, and interpretation of our work can be improved. Our responses to the specific points are below:
(1) The authors only used one individual plant to take samples. This makes it hard to generalize about the natural orchid microbiome.
We agree with the reviewer that the limited number of plant individuals restricts the generality of the conclusions. In the revised manuscript, we will clearly state that only one rhizosphere sample was available for analysis and will frame the study as exploratory. We will also explicitly acknowledge this limitation in the Discussion and ensure that our interpretations and conclusions remain appropriately cautious.
(2) The authors use both 16S amplicon sequencing and shotgun metagenomics to analyse the microbiome. However, the authors barely discuss the similarities and differences between the results of these two methods, even though comparing these results may be able to provide further insights into the conclusions of the authors. For example, the relative abundance of the ASVs from the amplicon analysis is not linked to the relative abundances of the MAGs.
In the revised manuscript, we will expand the Results and Discussion to include a clearer comparison between the taxonomic profiles derived from 16S amplicon sequencing and those obtained from shotgun metagenomic binning.
(3) Furthermore, the authors discuss that phyla present in the orchid microbiome are also found in other microbiomes and are linked to important ecological functions. However, their results reach further than the phylum level, and a discussion of genera or even species is lacking. The phyla that were found have very large within-phylum functional variability, and reliable functional conclusions cannot be drawn based on taxonomic assignment at this level, or even the genus level (Yan et al. 2017).
In the revised manuscript, we will incorporate taxonomic discussion at finer resolution where reliable assignments are available. We will also revise the Discussion to avoid overinterpreting phylum-level taxonomy in terms of ecological function.
(4) Additionally, although the authors mention their techniques used, their method section is sometimes not clear about how samples or replicates were defined. There are also inconsistencies between the methods and the results section, for example, regarding the prediction of secondary metabolite biosynthetic gene clusters (BGCs).
In the revised Methods section, we will clearly define the number and type of samples included in each analysis, specify the number of replicates and how they were handled, and provide a clearer description of the biosynthetic gene cluster (BGC) prediction workflow, including the tools used and how results were interpreted.
(5) The BGC prediction was done with several tools, and the unusually high number of found BGCs (181 in their high-quality MAG) is likely due to false positives or fragmented BGCs. The numbers are much higher than any numbers ever reported in literature supported by functional evidence (Amos et al, 2017), even in a prolific genus like Streptomyces (Belknap et al., 2020). This caveat is not discussed by the authors.
We thank the reviewer for this important point. Our original intention was to present the BGC predictions as a resource for future exploration, which is why multiple tools were used. However, we understand how this approach may lead to confusion, particularly regarding the confidence level of the predicted clusters and the potential inflation of counts due to assembly fragmentation or tool sensitivity. In the revised manuscript, we will thoroughly revise this section to clearly distinguish highconfidence predictions from more exploratory findings. We will focus on results supported by stronger evidence, explicitly qualify lower-confidence predictions as putative, and temper any functional interpretations accordingly.
(6) The authors have generated one high-quality MAG and three medium-quality MAGs. In the discussion, they present all four of these as high-quality, which could be misleading. The authors discuss what was found in the literature about the role of the bacterial genera/phyla linked to these MAGs in plant rhizospheres, but they do not sufficiently link their own analysis results (metabolic pathway enrichment and biosynthetic gene cluster prediction) to this discussion. The results of these analyses are only presented in tables without further explanation in either the results section or the discussion, even though there may be interesting findings. For example, the authors only discuss the class of the BGCs that were found, but don't search for experimentally verified homologs in databases, which could shed more light on the possible functional roles of BGCs in this microbiome.
In the revised manuscript, we will ensure that MAG quality is described accurately and consistently throughout, distinguishing clearly between high-quality and medium-quality bins according to accepted standards.
(7) In the conclusions, the authors state: "These analyses uncovered potential metabolic capabilities and biosynthetic potentials that are integral to the rhizosphere's ecological dynamics." I don't see any support for this. Mentioning that certain classes of BGCs are present is not enough to make this claim, in my opinion. Any BGC is likely important for the ecological niche the bacteria live in. The fact that rhizosphere bacteria harbour BGCs is not surprising, and it doesn't tell us more than is already known.
In the revised manuscript, we will rewrite the conclusion to reflect a more cautious interpretation, focusing on the potential metabolic and biosynthetic capabilities suggested by the data without asserting ecological roles that cannot be directly supported. These capabilities will be presented as hypotheses for future investigation rather than established ecological features.
eLife Assessment
This manuscript makes a valuable contribution to the concept of fragility of meta-analyses via the so-called 'ellipse of insignificance for meta-analyses' (EOIMETA). The strength of evidence is solid, supported primarily by an example of the fragility of meta-analyses in the association between Vitamin D supplementation and cancer mortality, but the approach could be applied in other meta-analytic contexts. The significance of the work could be enhanced with a more thorough assessment of the impact of between-study heterogeneity, additional case studies, and improved contextualization of the proposed approach in relation to other methods.
Reviewer #1 (Public review):
Summary:
This manuscript addresses an important methodological issue - the fragility of meta-analytic findings - by extending fragility concepts beyond trial-level analysis. The proposed EOIMETA framework provides a generalizable and analytically tractable approach that complements existing methods such as the traditional Fragility Index and Atal et al.'s algorithm. The findings are significant in showing that even large meta-analyses can be highly fragile, with results overturned by very small numbers of event recodings or additions. The evidence is clearly presented, supported by applications to vitamin D supplementation trials, and contributes meaningfully to ongoing debates about the robustness of meta-analytic evidence. Overall, the strength of evidence is moderate to strong, though some clarifications would further enhance interpretability.
Strengths:
(1) The manuscript tackles a highly relevant methodological question on the robustness of meta-analytic evidence.
(2) EOIMETA represents an innovative extension of fragility concepts from single trials to meta-analyses.
(3) The applications are clearly presented and highlight the potential importance of fragility considerations for evidence synthesis.
Weaknesses:
(1) The rationale and mathematical details behind the proposed EOI and ROAR methods are insufficiently explained. Readers are asked to rely on external sources (Grimes, 2022; 2024b) without adequate exposition here. At a minimum, the definitions, intuition, and key formulas should be summarized in the manuscript to ensure comprehensibility.
(2) EOIMETA is described as being applicable when heterogeneity is low, but guidance is missing on how to interpret results when heterogeneity is high (e.g., large I²). Clarification in the Results/Discussion is needed, and ideally, a simulation or illustrative example could be added.
(3) The manuscript would benefit from side-by-side comparisons between the traditional FI at the trial level and EOIMETA at the meta-analytic level. This would contextualize the proposed approach and underscore the added value of EOIMETA.
(4) Scope of FI: The statement that FI applies only to binary outcomes is inaccurate. While originally developed for dichotomous endpoints, extensions exist (e.g., Continuous Fragility Index, CFI). The manuscript should clarify that EOIMETA focuses on binary outcomes, but FI, as a concept, has been generalized.
Reviewer #2 (Public review):
Summary:
The study expands existing analytical tools originally developed for randomized controlled trials with dichotomous outcomes to assess the potential impact of missing data, adapting them for meta-analytical contexts. These tools evaluate how missing data may influence meta-analyses where p-value distributions cluster around significance thresholds, often leading to conflicting meta-analyses addressing the same research question. The approach quantifies the number of recodings (adding events to the experimental group and/or removing events from the control group) required for a meta-analysis to lose or gain statistical significance. The author developed an R package to perform fragility and redaction analyses and to compare these methods with a previously established approach by Atal et al. (2019), also integrated into the package. Overall, the study provides valuable insights by applying existing analytical tools from randomized controlled trials to meta-analytical contexts.
Strengths:
The author's results support his claims. Analyzing the fragility of a given meta-analysis could be a valuable approach for identifying early signs of fragility within a specific topic or body of evidence. If fragility is detected alongside results that hover around the significance threshold, adjusting the significance cutoff as a function of sample size should be considered before making any binary decision regarding statistical significance for that body of evidence. Although the primary goal of meta-analysis is effect estimation, conclusions often still rely on threshold-based interpretations, which is understandable. In some of the examples presented by Atal et al. (2019), the event recoding required to shift a meta-analysis from significant to non-significant (or vice versa) produced only minimal changes in the effect size estimation. Therefore, in bodies of evidence where meta-analyses are fragile or where results cluster near the null, it may be appropriate to adjust the cutoff. Conducting such analyses-identifying fragility early and adapting thresholds accordingly-could help flag fragile bodies of evidence and prevent future conflicting meta-analyses on the same question, thereby reducing research waste and improving reproducibility.
Weaknesses:
It would be valuable to include additional bodies of conflicting literature in which meta-analyses have demonstrated fragility. This would allow for a more thorough assessment of the consistency of these analytical tools, their differences, and whether this particular body of literature favored one methodology over another. The method proposed by Atal et al. was applied to numerous meta-analyses and demonstrated consistent performance. I believe there is room for improvement, as both the EOI and ROAR appear to be very promising tools for identifying fragility in meta-analytical contexts.
I believe the manuscript should be improved in terms of reporting, with clearer statements of the study's and methods' limitations, and by incorporating additional bodies of evidence to strengthen its claims.
Reviewer #3 (Public review):
Summary and strengths:
In this manuscript, Grimes presents an extension of the Ellipse of Insignificant (EOI) and Region of Attainable Redaction (ROAR) metrics to the meta-analysis setting as metrics for fragility and robustness evaluation of meta-analysis. The author applies these metrics to three meta-analyses of Vitamin D and cancer mortality, finding substantial fragility in their conclusions. Overall, I think extension/adaptation is a conceptually valuable addition to meta-analysis evaluation, and the manuscript is generally well-written.
Specific comments:
(1) The manuscript would benefit from a clearer explanation of in what sense EOIMETA is generalizable. The author mentions this several times, but without a clear explanation of what they mean here.
(2) The authors mentioned the proposed tools assume low between-study heterogeneity. Could the author illustrate mathematically in the paper how the between-study heterogeneity would influence the proposed measures? Moreover, the between-study heterogeneity is high in Zhang et al's 2022 study. It would be a good place to comment on the influence of such high heterogeneity on the results, and specifying a practical heterogeneity cutoff would better guide future users.
(3) I think clarifying the concepts of "small effect", "fragile result", and "unreliable result" would be helpful for preventing misinterpretation by future users. I am concerned that the audience may be confusing these concepts. A small effect may be related to a fragile meta-analysis result. A fragile meta-analysis doesn't necessarily mean wrong/untrustworthy results. A fragile but precise estimate can still reflect a true effect, but whether that size of true effect is clinically meaningful is another question. Clarifying the effect magnitude, fragility, and reliability in the discussion would be helpful.
eLife Assessment
In the gram-positive model organism Bacillus subtilis, the membrane associated ParA family member MinD, concentrates the division inhibitor MinC at cell poles where it prevents aberrant division events. This important study presents compelling data suggesting that polar localization of MinCD is largely due to differences in diffusion rates between monomeric and dimeric MinD. This finding is exciting as it negates the necessity for a third, localization determinant, in this system as has been proposed by previous investigations.
Reviewer #1 (Public review):
The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V) and ATP-bound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusions that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but the binding affinities between monomers and dimers are similar.
The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future.
Reviewer #3 (Public review):
This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD oscillates from pole-to-pole courtesy of a third protein MinE, how MinD localization is achieved in B. subtilis-which does not encode a MinE analog-has remained largely a mystery. The authors present compelling data indicating that MinD dimerization is dispensable for membrane localization but required for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers lead to concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.
Comments on revisions:
I believe the authors put respectable effort into revisions and addressing reviewer comments, particularly those that focused on the strengths of the original conclusions. The language in the current version of the manuscript is more precise and the overall product is stronger.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V), and ATPbound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusion that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but that the binding affinities between monomers and dimers are similar.
The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future. However, the current story is sufficient without testing these assumptions or predictions.
Reviewer #2 (Public review):
Summary:
Bohorquez et al. investigate the molecular determinants of intracellular gradient formation in the B. subtilis Min system. To this end, they generate B. subtilis strains that express MinD mutants that are locked in the monomeric or dimeric states, and also MinD mutants with amphipathic helices of varying membrane affinity. They then assess the mutants' ability to bind to the membrane and form gradients using fluorescence microscopy in different genetic backgrounds. They find that, unlike in the E. coli Min system, the monomeric form of MinD is already capable of membrane binding. They also show that MinJ is not required for MinD membrane binding and only interacts with the dimeric form of MinD. Using kinetic
Monte Carlo simulations, the authors then test different models for gradient formation, and find that a MinD gradient along the cell axis is only formed when the polarly localized protein MinJ stimulates dimerization of MinD, and when the diffusion rate of monomeric and dimeric MinD differs. They also show that differences in the membrane affinity of MinD monomers and dimers are not required for gradient formation.
Strengths:
The paper offers a comprehensive collection of the subcellular localization and gradient formation of various MinD mutants in different genetic backgrounds. In particular, the comparison of the localization of these mutants in a delta MinC and MinJ background offers valuable additional insights. For example, they find that only dimeric MinD can interact with MinJ. They also provide evidence that MinD locked in a dimer state may co-polymerize with MinC, resulting in a speckled appearance.
The authors introduce and verify a useful measure of membrane affinity in vivo.
The modulation of the membrane affinity by using distinct amphipathic helices highlights the robustness of the B. subtilis MinD system, which can form gradients even when the membrane affinity of MinD is increased or decreased.
Weaknesses:
The main claim of the paper, that differences in the membrane affinity between MinD monomers and dimers are not required for gradient formation, does not seem to be supported by the data. The only measure of membrane affinity presented is extracted from the transverse fluorescence intensity profile of cells expressing the mGFP-tagged MinD mutants. The authors measure the valley-to-peak ratio of the profile, which is lower than 1 for proteins binding to the membrane and higher than 1 for cytosolic proteins. To verify this measure of membrane affinity, they use a membrane dye and a soluble GFP, which results in values of ~0.75 and ~1.25, respectively. They then show that all MinD mutants have a value - roughly in the range of 0.8-0.9 - and they use this to claim that there are no differences in membrane affinity between monomeric and dimeric versions.
While this way to measure membrane affinity is useful to distinguish between binders and non-binders, it is unclear how sensitive this assay is, and whether it can resolve more subtle differences in membrane affinity, beyond the classification into binders and non-binders. A dimer with two amphipathic helices should have a higher membrane affinity than a monomer with only one such copy. Thus, the data does not seem to support the claim that "the different monomeric mutants have the same membrane affinity as the wildtype MinD". The data only supports the claim that B. subtilis MinD monomers already have a measurable membrane affinity, which is indeed a difference from the E. coli Min system.
While their data does show that a stark difference between monomer and dimer membrane affinity may not be required for gradient formation in the B. subtilis case, it is also not prevented if the monomer is unable to bind to the membrane. They show this by replacing the native MinD amphipathic helix with the weak amphipathic helix NS4AB-AH. According to their membrane affinity assay, NS4AB-AH does not bind to the membrane as a monomer (Figure 4D), but when this helix is fused to MinD, MinD is still capable of forming a gradient (albeit a weaker one). Since the authors make a direct comparison to the E. coli MinDE systems, they could have used the E. coli MinD MTS instead or in addition to the NS4AB-AH amphipathic helix. The reviewer suspects that a fusion of the E. coli MinD MTS to B. subtilis MinD may also support gradient formation.
The paper contains insufficient data to support the many claims about cell filamentation and minicell formation. In many cases, statements like "did not result in cell filamentation" or "restored cell division" are only supported by a single fluorescence image instead of a quantitative analysis of cell length distribution and minicell frequency, as the one reported for a subset of the data in Figure 5.
The paper would also benefit from a quantitative measure of gradient formation of the distinct MinD mutants, instead of relying on individual fluorescent intensity profiles.
The authors compare their experimental results with the oscillating E. coli MinDE system and use it to define some of the rules of their Monte Carlo simulation. However, the description of the E. coli Min system is sometimes misleading or based on outdated findings.
The Monte Carlo simulation of the gradient formation in B. subtilis could benefit from a more comprehensive approach:
(1) While most of the initial rules underlying the simulation are well justified, the authors do not implement or test two key conditions:
(a) Cooperative membrane binding, which is a key component of mathematical models for the oscillating E. coli Min system. This cooperative membrane binding has recently been attributed to MinD or MinCD oligomerization on the membrane and has been experimentally observed in various instances; in fact, the authors themselves show data supporting the formation of MinCD copolymers.
(2) Local stimulation of the ATPase activity of MinD which triggers the dimer-to-monomer transition; E. coli MinD ATP hydrolysis is stimulated by the membrane and by MinE, so B. subtilis MinD may also be stimulated by the membrane and/or other components like MinJ. Instead, the authors claim that (a) would only increase differences in diffusion between the monomer and different oligomeric species, and that a 2-fold increase in dimerization on the membrane could not induce gradient formation in their simulation, in the absence of MinJ stimulating gradient formation. However, a 2-fold increase in dimerization is likely way too low to explain any cooperative membrane binding observed for the E. coli Min system. Regarding (b), they also claim that implementing stimulation of ATP hydrolysis on the membrane (dimer-to-monomer transition) would not change the outcome, but no simulation result for this condition is actually shown.
(3) To generate any gradient formation, the authors claim that they would need to implement stimulation of dimer formation by MinJ, but they themselves acknowledge the lack of any experimental evidence for this assertion. They then test all other conditions (e.g., differences in membrane affinity, diffusion, etc.) in addition to the requirement that MinJ stimulates dimer formation. It is unclear whether the authors tested all other conditions independently of the "MinJ induces dimerization" condition, and whether either of those alone or in combination could also lead to gradient formation. This would be an important test to establish the validity of their claims.
Reviewer #3 (Public review):
This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD oscillates from pole to pole courtesy of a third protein MinE, how MinD localization is achieved in B. subtilis - which does not encode a MinE analog - has remained largely a mystery. The authors present compelling data indicating that MinD dimerization is dispensable for membrane localization but required for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers lead to the concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.
Reviewer #1 (Recommendations for the authors):
(1) The title could be modified to better reflect the emphasis on MinD monomer and dimer diffusion rather than the fact that membrane affinity is not important in MinD gradient formation. In addition, because membrane association requires affinity for the membrane, this title seems inconsistent with statements in the main text, such as Lines 246-247: a reversible membrane association is important for the formation of a MinD gradient along the cell axis.
We agree with the reviewer that the title can be more accurate, and we have now changed it to “Membrane affinity difference between MinD monomer and dimer is not crucial to MinD gradient formation in Bacillus subtilis”
(2) This paper reports that the difference in diffusion rates between MinD monomers and dimers is an important factor in the formation of Bs MinD gradients. However, one can argue for the importance of MinD monomers in the cellular context. Since the abundance of ATP in cells often far exceeds the abundance of MinD protein molecules under experimental conditions, MinD can easily form dimers in the cytoplasm. How does the author address this problem?
It is a good point that ATP concentration in the cell likely favours dimers in the cytoplasm. However, what is important in our model is that there is cycling between monomer and dimer, rather than where exactly this happen. In fact, the gradients works essentially equally well if dimers can become monomers only whilst they are at the membrane, as we have mentioned in the manuscript (lines 324-326 in the original manuscript). However, in the original manuscript this simulation was not shown, and now we have included this in the new Fig. 8D & E.
(3)The claim "This oscillating gradient requires cycling of MinD between a monomeric cytosolic and a dimeric membrane attached state." (Lines 46, 47) is not well supported by most current studies and needs to be revised since to my knowledge, most proposed models do not consider the monomer state. The basic reaction steps of Ec Min oscillations include ATP-bound MinD dimers attaching to the membrane that subsequently recruit more MinD dimers and MinE dimers to the membrane; MinE interactions stimulate ATP hydrolysis in MinD, leading to dissociation of ADP-bound MinD dimers from the membrane; nucleotide exchange occurs in the cytoplasm.
Here the reviewer refers to a sentence in a short “Importance” abstract that we have added. In fact, such abstract is not necessary, so we have removed it. Of note, the E. coli MinD oscillation, including the role of MinE, is described in detail in the Introduction.
A recent reference is a paper by Heermann et al. (2020; doi: 10.1016/j.jmb.2020.03.012), which considers the MinD monomer state, which is not mentioned in this work. How do their observations compare to this work?
The Heermann paper mentions that MinD bound to the membrane displays an interface for multimerization, and that this contributes to the local self-enhancement of MinD at the membrane. In our Discussion, we do mention that E. coli MinD can form polymers in vitro and that any multimerization of MinD dimers will further increase the diffusion difference between monomer and dimer, and might contribute to the formation of a protein gradient (lines 459-467). We have now included a reference to the Heermann paper (line 461).
(4) Throughout the manuscript, errors in citing references were found in several places.
We have corrected this where suggested.
(5) The introduction may be somewhat misleading due to mixed information from experimental cellular results, in vitro reconstructions, and theoretical models in cells or in vitro environments. Some models consider space constraints, while others do not. Modifications are recommended to clarify differences.
See below for responses
(6) The citation for MinD monomers:
The paper by Hu and Lutkenhaus (2003, doi: 10.1046/j.1365-2958.2003.03321.x.) contains experimental evidence showing monomer-dimer transition using purified proteins. Another paper by the same laboratory (Park et al. 2012, doi: 10.1111/j.1365-2958.2012.08110.x.) explained how ATP-induced dimerization, but this paper is not cited.
The Park et al. 2012 paper focusses at the asymmetric activation of MinD ATPase by MinE, which goes beyond the scope of our work. However, we have cited several other papers from the Lutkenhaus lab, including the Wu et al. 2011 paper describing the structure of the MinD-ATP complex.
Other evidence comes from structural studies of Archaea Pyrococcus furiosus (1G3R) and Pyrococcus horikoshii (1ION), and thermophilic Aquifex aeolicus (4V01, 4V02, 4V03). As they may function differently from Ec MinD, they are less relevant to this manuscript.
We agree.
(7) Lines 65, 66: Using the term 'a reaction-diffusion couple' to describe the biochemical facts by citing references of Hu and Lutkenhaus (1999) and Raskin and de Boer (1999) is not appropriate. The idea that the Min system behaves as a reaction-diffusion system was started by Howard et al. (2001), Meinhardt and de Boer (2001), and Huang et al. (2003) et al. In addition, references for MinE oscillation are missing.
We have now corrected this (line 52).
(8) Lines 77-79: Citations are incorrect.
ATP-induced dimerization: Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x), Park et al. (2012). C-terminal amphipathic helix formation: Szeto et al. (2003), Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x).
Citations have been corrected.
(9) Line 78: The C-terminal amphipathic helix is not pre-formed and then exposed upon conformational change induced by ATP-binding. This alpha-helical structure is an induced fold upon interaction with membranes as experimentally demonstrated by Szeto et al. (2003).
We have adjusted the text to correct this (lines 64-66).
(10) Line 102: 'cycles between membrane association and dissociation of MinD' also requires MinE in addition to ATP.
We believe that in the context of this sentence and following paragraph it is not necessary to again mention MinE, since it is focused on parallels between the E. coli and B. subtilis MinD membrane binding cycles.
(11) In the introduction, could the author briefly explain to a general audience the difference between Monte Carlo and reaction-diffusion methods? How do different algorithms affect the results?
The main difference between the kinetic Monte Carlo and typical reaction-diffusion methods which is relevant to our work is that the first is particle-based, and naturally includes statistical fluctuations (noise), whereas the second is field-based, and is in the normal implementation deterministic, so does not include noise. Whilst it should be noted that one can in principle include noise in the field-based reactiondiffusion methods, this is done rarely. Additionally, although we do not do this here, the kinetic MonteCarlo can also account, in principle, for particle shape (sphere versus rod), or for localized interactions (as sticky patches on the surface): therefore the kinetic Monte Carlo is more microscopic in nature. We have now shortly described the difference in lines 102-105.
(12) Lines 126-128: The second part of the sentence uses the protein structure of Pyrococcus furiosus MinD (Ref 37) to support a protein sequence comparison between Ec and Bs MinD. However, the structure of the dimeric E. coli MinD-ATP complex (3Q9L) is available, which is Reference 38 that is more suited for direct comparison.
To discuss monomeric MinD from P. furiosus, it will be useful to include it in the primary sequence alignment in Figure S1.
We do not think that this detailed information is necessary to add to Figure S1, since the mutants have been described before (appropriate citations present in the text).
(13) Lines 127, 166: Where Figure S1 is discussed, a structural model of MinD will be useful alongside with the primary sequence alignment.
We do not think that this detailed information is necessary to understand the experiments since the mutants have been described before.
(14) Lines 131-132: Reference is missing for the sentence of " the conserved..."; Reference 38. In Reference 38, there is no experimental evidence on G12 but inferred from structure analysis. Reference 26 discusses ATP and MinE regulation on the interactions between MinD and phospholipid bilyers; not about MinD dimerization.
We have corrected this and added the proper references.
For easy reading, the mutant MinD phenotypes can be indicated here instead of in the figure legends, including K16A (apo monomer), MinD G12V (ATP-bound monomer), and MinD D40A (ATP-bound dimer, ATP hydrolysis deficient).
We have added the suggested descriptions of the mutants in the main text.
(15) Lines 150-151: Unlike Ec MinD, which forms a clear gradient in one half of the cell, Bs MinD (wild type) mainly accumulates at the hemispheric poles. What percentage of a cell (or cell length) can be covered by the Bs MinD gradient? How does the shaded area in the longitudinal FIP compare to the area of the bacterial hemispherical pole? If possible, it might be interesting to compare with the range of nucleoid occlusion mechanisms that occur.
Part of the MinD gradient covers the nucleoid area, since the fluorescence signal is still visible along the cell lengths, yet there is no sudden drop in fluorescence, suggesting that nucleoid exclusion does not play a role.
(16) Line 160: In addition to summarizing the membrane-binding affinity, descriptions of the differences in the gradient distribution or formation will be useful.
We have done this in lines 155-156 of the original manuscript: “The monomeric ATP binding G12V variant shows the same absence of a protein gradient as the K16A variant”.
(17) Line 262: 'distribution' is not shown.
We do not understand this remark. This information is shown in Fig. 5B (now Fig. 6B).
(18) Line 287: Wrong citation for reference 31.
Reference has been corrected.
(19) Line 288 and lines 596 regarding the Monte Carlo simulation:
(a) An illustration showing the reaction steps for MinD gradient formation will help understand the rationale and assumptions behind this simulation.
We have added an illustration depicting the different modelling steps in the new Fig. 8.
(b) Equations are missing.
(c) A table summarizing the parameters used in the simulation and their values.
(d) For general readers, it will be helpful to convert the simulation units to real units.
(e) Indicate real experimental data with a citation or the reason for any speculative value.
The Methods section provides a discussion of all parameters used in the potentials on which our kinetic Monte-Carlo algorithm is based. We have now also provided a Table in the SI (Table S1) with typical parameter values in both simulation units and real units. The experimental data and reasoning behind the values chosen are discussed in the Methods section (see “Kinetic Monte Carlo simulation”).
(20) Lines 320-321: Reference missing.
The interaction between MinJ and the dimer form of MinD is based on our findings shown in the original Fig. S4, and this information has not been published before. We have rephrased the sentence to make it more clear. Of note, Fig. S4 has been moved to the main manuscript, at the request of reviewer #2, and is now new Fig. 2.
(21) Lines 355-359: Is the statement specifically made for the Bs Min system? Is there any reference for the statement? Isn't the differences in diffusion rates between molecules 'at different locations' in the system more important than reducing their diffusion rates alone? It is unclear about the meaning of the statement "the Min system uses attachment to the membrane to slow down diffusion". Is this an assumption in the simulation?
The statement is generic, however the reviewer has a good point and we have made this statement more clear by changing “considerably reduced diffusion rate” to “locally reduced diffusion rate” (line 359).
(22) Line 403: Citation format.
We have corrected the text and citation.
(23) Lines 442-444: The parameters are not defined anywhere in the manuscript.
Discussed in the M&M and in the new Table S1.
(24) Lines 464-465: Regarding the final sentence, what does 'this prediction' refer to? Hasn't the author started with experimental observations, predicted possible factors of membrane affinity and diffusion rates, and used the simulation approach to disapprove or support the prediction?
We have changed “prediction” to “suggestion”, to make it clear that it is related to the suggestion in the previous sentence that “our modelling suggests that stimulation of MinD-dimerization at cell poles and cell division sites is needed.” (line 471).
(25) Materials and Methods: Statistical methods for data analyses are missing.
Added to “Microscopy” section.
(26) References: References 34, 40, 51 are incomplete.
References 34 and 40 have been corrected. Reference 51 is a book.
(27) Figures: The legends (Figures 1-7) can be shortened by removing redundant details in Material and Methods. Make sure statistical information is provided. The specific mutant MinD states, including Apo monomer, ATP-bound dimer, ATP hydrolysis deficient, and non-membrane binding etc can be specified in the main text. They are repeated in the legends of Figures 1 and 2.
We have removed redundant details from the legends and provided statistical information.
(28) Supporting information:
Table S1: Content of the acknowledgment statement may be moved to materials and methods and the acknowledgment section. Make sure statistical information is provided in the supporting figure legends.
We are not sure what the reviewer means with the content acknowledgement in Table S1 (now Table S2). Statistical information has been added.
Figure S1. Adding a MinD structure model will be useful.
We do not think that a structural model will enlighten our results since our work is not focused at structural mutagenesis. The mutants that we use have been described in other papers that we have cited.
Reviewer #2 (Recommendations for the authors):
The authors should cite and relate their data to the preprint by Feddersen & Bramkamp, BioRxiv 2024. ATPase activity of B. subtilis MinD is activated solely by membrane binding.
We have now discussed this paper in relation to our data in lines 407-409.
I am not convinced the authors are able to make the statement in lines 160-161 based on their assay: "This confirmed that the different monomeric mutants have the same membrane affinity as wild-type MinD". It is unclear if measuring valley-to-peak ratios in their longitudinal profiles can resolve small differences in membrane affinity. Wildtype MinD should at least be dimeric, or (as the authors also note elsewhere) may even be present in higher-order structures and as such have a higher membrane affinity than a monomeric MinD mutant. The authors should rephrase the corresponding sections in the manuscript to state that the MinD monomer already has detectable membrane affinity, instead of stating that the monomer and dimer membrane affinity are the same.
We agree that “the same affinity” is too strongly worded, and we have now rephrased this by saying that the different monomeric mutants have a comparable membrane affinity as wild type MinD (line 152).
According to the authors' analysis, MinD-NS4B would not bind to the membrane as it has a valley-to-peak ratio higher than 1, similar to the soluble GFP. However, the protein is clearly forming a gradient, and as such probably binding to the membrane. The authors should discuss this as a limitation of their membrane binding measure.
The ratio value of 1 is not a cutoff for membrane binding. As shown in Fig. 1F, GFP has a valley-topeak ratio close to 1.25, whereas the FM5-95 membrane dye has a ratio close to 0.75. In Fig. 3C (now Fig. 4C) we have shown that GFP fused with the NS4B membrane anchor has a lower ratio than free GFP, and we have shown the same in Fig. 4D (now Fig. 5D) for GFP-MinD-NS4B. The difference are small but clear, and not similar to GFP.
The observation that MinD dimers are localized by MinJ is interesting and key to the rule of the Monte Carlo simulation that dimers attach to MinJ. However, the data is hidden in the supplementary information and is not analysed as comprehensively, e.g., it lacks the analysis of the membrane binding. The paper would benefit from moving the fluorescence images and accompanying analysis into the main text.
We have moved this figure to the main text and added an analysis of the fluorescence intensities (new Fig. 2).
The authors should show the data for cell length and minicell formation, not only for the MinDamphipathic helix versions (Fig. 5), but also for the GFP-MinD, and all the MinD mutants. They do refer to some of this data in lines 145-148 but do not show it anywhere. They also refer to "did not result in cell filamentation" in line 213 and to "resulted in highly filamentous cells" and "Introduction of a minC deletion restored cell division" in lines 167-160 without showing the cell length and minicell data, but instead refer to the fluorescence image of the respective strain. I would suggest the authors include this data either in a subpanel in the respective figure or in the supplementary information.
The effect of uncontrolled MinC activity is very apparent and leads to long filamentous cells. Also the occurrence of minicells is apparent. Cell lengths distribution of wild type cells is shown in Fig. 6B, and minicell formation is negligibly small in wild type cells.
The transverse fluorescence intensity profiles used as a measure for membrane binding are an average profile from ~30 cells. In the case of the longitudinal profiles that display the gradient, only individual profiles are displayed. I understand that because of distinct cell length, the longitudinal profiles cannot simply be averaged. However, it is possible to project the profiles onto a unit length for averaging (see for example the projection of profiles in McNamara. et al., BioRxiv (2023)). It would be more convincing to average these profiles, which would allow the authors to also quantify the gradients in more detail. If that is impossible, the authors may at least quantify individual valley-to-peak ratios of the longitudinal fluorescence profiles as a measure of the gradient.
We agree that in future work it would be better to average the profiles as suggested. However, due to limited time and resources, we cannot do this for the current manuscript.
Regarding the rules and parameters used for the Monte Carlo simulation (see also the corresponding sections in the public review):
(1) The authors mention that they have not included multimerization of MinD in their simulation but argue in the discussion that it would only strengthen the differences in the diffusion between monomers and multimers. This is correct, but it may also change the membrane residence time and membrane affinity drastically.
Simulation of multimerization is difficult, but we have now included a simulation whereby MinD dimers can also form tetramers (lines 341-348), shown in the new Fig. 8K. This did not alter the MinD gradient much.
(2) The authors implement a dimer-to-monomer transition rate that they equate with the stochastic ATP hydrolysis rate occurring with a half-life of approximately 1/s (line 305). They claim that this rate is based on information from E. coli and cite Huang and Wingreen. However, the Huang paper only mentions the nucleotide exchange rate from ADP to ATP at 1/s. Later that paper cites their use of an ATP hydrolysis rate of 0.7/s to match the E. coli MinDE oscillation rate of 40s. From the authors' statement, it is unclear to me whether they refer to the actual ATP hydrolysis rate in Huang and Wingreen or something else. For E. coli MinD, both the membrane and MinE stimulate ATPase activity. Even if B. subtilis lacks MinE, ATP hydrolysis may still be stimulated by the membrane, which has also been reported in another preprint (Feddersen & Bramkamp, BioRxiv 2024). It may also be stimulated by other components of the Min system like MinJ. The authors should include in the manuscript the Monte Carlo simulation implementing dimer to monomer transition on the membrane only, which is currently referred to only as "(data not shown)".
The exact value of the ATP hydrolysis rate is not so important here, so 1/s only gives the order of magnitude (in line with 0.7/s above), which we have now clarified in lines 631-632. We have now also added the “(data not shown” results to Fig. 8, i.e. simulations where dimer to monomer transitions (i.e. ATPase activity) only occurs at the membrane (Fig. 8D & E, and lines 319-322).
(3) How long did the authors simulate for? How many steps? What timesteps does the average pictured in Figure 7 correspond to?
We simulated 10^7time steps (corresponding to 100 s in real time). We have checked that the simulation steps for which we average are in steady state. Typical snapshots are recorded after 10^610^7time steps, when the system is in steady state. We have added this information in lines 299-300.
There are several misconceptions about the (oscillating E. coli) Min system in the main text:
(1) Lines 77-78: "In case of the E. coli MinD, ATP binding leads to dimerization of MinD, which induces a conformational change in the C-terminal region, thereby exposing an amphiathic helix that functions as a membrane binding domain" and "This shows a clear difference with the E. coli situation, where dimerization of MinD causes a conformational change of the C-terminal region enabling the amphipathic helix to insert into the lipid bilayer" in lines 400-403 are incorrect. There is no evidence that the amphipathic helix at the C-terminus of MinD changes conformation upon ATP binding; several studies have shown instead that a single copy of the amphipathic helix is too weak to confer efficient membrane binding but that the dimerization confers increased membrane binding as now two amphipathic helices are present leading to an avidity effect in membrane binding. Please refer to the following papers (Szeto et al., JBC (2003); Wu et al., Mol Microbiol (2011); Park et al., Cell (2011); Heermann et al., JMB (2020); Loose et al., Nat Struct Mol Biol (2011); Kretschmer et al., ACS Syn Biol (2021); Ramm et al., Nat Commun (2018) or for a better overview the following reviews on the topic of the E. coli Min system Wettmann and Kruse, Philos Trans R Soc B Biol (2018), Ramm et al., Cell and Mol Life Sci (2019); Halatek et al., Philos Trans R SocB Biol Sci (2018).
This is indeed incorrectly formulated, and we have now amended this in lines 64-66 and lines 403406. Key papers are cited in the text.
(2) The authors mention that E. coli MinD may multimerize, citing a study where purified MinD was found to polymerize, and then suggest that this is unlikely to be the case in B. subtilis as FRAP recovery of MinD is quick. However, cooperativity in membrane binding is essential to the mathematical models reproducing E. coli Min oscillations, and there is more recent experimental evidence that E. coli MinD forms smaller oligomers that differ in their membrane residence time and diffusion (e.g., Heermann et al., Nat Methods (2023); Heermann et al., JMB (2020);) I would suggest the authors revise the corresponding text sections and test the multimerization in their simulation (see above).
As mentioned above, simulating oligomerization is difficult, but in order to approximate related cooperative effects, we have simulated a situation whereby MinD dimers can form tetramers. This simulation did not show a large change in MinD gradient formation. We have added the result of this simulation to Fig. 8 (Fig. 8K), and discuss this further in lines 341-348 and 459-467.
(3) Lines 75-76 and lines 79-80: The sentences "MinC ... and needs to bind to the Walker A-type ATPase MinD for its activity" and "The MinD dimer recruits MinC ... and stimulates its activity" are misleading. MinC is localized by MinD, but MinD does not alter MinC activity, as MinC mislocalization or overexpression also prevents FtsZ ring formation leading to minicell or filamentous cells, as also later described by the authors (line 98). There is also no biochemical evidence that the presence of MinD somehow alters MinC activity towards FtsZ other than a local enrichment on the membrane. I would rephrase the sentence to emphasize that MinD is only localizing MinC but does not alter its activity.
We have rephrased this sentence to prevent misinterpretation (lines 66-67).
Minor points:
(1) I am not quite sure what the experiment with the CCCP shows. The authors explain that MinD binding via the amphipathic helix requires the presence of membrane potential and that the addition of CCCP disturbs binding. They then show that the MinD with two amphipathic helices is not affected by CCCP but the wildtype MinD is. What is the conclusion of this experiment? Would that mean that the MinD with two amphipathic helices binds more strongly, very differently, perhaps non-physiologically?
This experiment was “To confirm that the tandem amphipathic helix increased the membrane affinity of MinD”, as mentioned in the beginning of the paragraph (line 224).
(2) Lines 456-457: Please cite the FRAP experiment that shows a quick recovery rate of MinD.
Reference has been added.
(3) Figure 4D: It is unclear to me to which condition the p-value brackets point.
This is related to a statistical t-test. We have added this information to the legend of the figure.
(4) Line 111, "in the membrane affinity of the MinD". I think that the "the" before MinD should be removed.
Corrected
(5) Typo in line 199 "indicting" instead of indicating.
Corrected
(6) Typo in line 220 "reversable" instead of reversible.
Corrected
(7) Lines 279, 284, 905: "Monte-Carlo" should read Monte Carlo.
Corrected
Reviewer #3 (Recommendations for the authors):
Introduction: As written, the introduction does not provide sufficient background for the uninitiated reader to understand the function of the MinCD complex in the context of assembly and activation of cell division in B. subtilis. The introduction is also quite long and would benefit from condensing the description of the Min oscillation mechanism in E. coli to one or two sentences. While highlighting the role of MinE in this system is important for understanding how it works, it is only needed as a counterpoint to the situation in B. subtilis.
Since the Min system of E. coli is by far the best understood Min system, we feel that it is important to provide detailed information on this system. However, we have added an introductory sentence to explain the key function of the Min system (line 46-48).
Line 248: Increasing MinD membrane affinity increases the frequency of minicells - however it is unclear if cells are dividing too much or if it is just a Min mutant (i.e. occasionally dividing at the cell pole vs the middle)? Cell length measurements should be included to clarify this point (Figures 4 and 5).
This information is presented in Fig. 5B (Cell length distribution), which is now Fig. 6B, indicating that the average cell length increases in the tandem alpha helix mutant, a phenotype that is comparable to a MinD knockout.
Figure 5: I am a bit confused as to whether increasing MinD affinity doesn't lead to a general block in division by MinCD rather than phenocopying a minD null mutant.
Although the tandem alpha helix mutant has a cell length distribution comparable to a minD knockout, the tandem mutant produces much less minicells then the minD knockout, indicating that there is still some cell division regulation.
eLife Assessment
This study, from the group that pioneered migrasome, describes a novel vaccine platform of engineered migrasomes that behave like natural migrasomes. Importantly, this platform has the potential to overcome obstacles associated with cold chain issues for vaccines such as mRNA. In the revised version, the authors have addressed previous concerns and the results from additional experiments provide compelling evidence that features methods, data, and analyses more rigorous than the current state-of-the-art. Although the findings are important with practical implications for the vaccine technology, results from additional experiments would make this an outstanding study.
Reviewer #1 (Public review):
Summary:
Outstanding fundamental phenomenon (migrasomes) en route to become transitionally highly significant.
Strengths:
Innovative approach at several levels: Migrasomes, discovered by DR. Yu's group, are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.
Weaknesses:
I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.
Comments on revisions: This reviewer feels that the authors have addressed all issues.
Reviewer #2 (Public review):
Summary:
The authors report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle for using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultures cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARS-CoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.
Strengths:
The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to from engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done, including thermal stability and characterization of the particle size (important characterizations for a good vaccine).
Weaknesses:
With a new vaccine platform technology, it would be nice to compare them head-to-head against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome based vaccine could elicit responses comparable to a proven vaccine technology. Additionally, understanding the integrity of the antigens expressed in their migrasomes could be useful. This could be done by looking at functional monoclonal antibody binding to their migrasomes in a confocal microscopy experiment.
Updates after revision:
The revised manuscript has additional experiments that I believe improve the strength of evidence presented in the manuscript and address the weaknesses of the first draft. First, they provide a comparison to the antibody responses induced by their migrasome based platform to recombinant protein formulated in an adjuvant and show the response is comparable. Second, they provide evidence that the spike protein incorporated into their migrasomes retains structural integrity by preserving binding to monoclonal antibodies. Together, these results strengthen the paper significantly and support the claims that the novel migrasome based vaccine platform could be a useful in the vaccine development field.
Author response:
The following is the authors’ response to the original reviews
Public Reviews:
Reviewer #1 (Public Review):
Summary:
This is an excellent study by a superb investigator who discovered and is championing the field of migrasomes. This study contains a hidden "gem" - the induction of migrasomes by hypotonicity and how that happens. In summary, an outstanding fundamental phenomenon (migrasomes) en route to becoming transitionally highly significant.
Strengths:
Innovative approach at several levels. Migrasomes - discovered by Dr Yu's group - are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.
Weaknesses:
I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.
We sincerely thank the reviewer for the encouraging and insightful comments. We fully agree that the fundamental aspects of migrasome biology are of great importance and deserve deeper exploration.
In line with the reviewer’s suggestion, we have expanded our discussion on the basic biology of engineered migrasomes (eMigs). A recent study by the Okochi group at the Tokyo Institute of Technology demonstrated that hypoosmotic stress induces the formation of migrasome-like vesicles, involving cytoplasmic influx and requiring cholesterol for their formation (DOI: 10.1002/1873-3468.14816, February 2024). Building on this, our study provides a detailed characterization of hypoosmotic stressinduced eMig formation, and further compares the biophysical properties of natural migrasomes and eMigs. Notably, the inherent stability of eMigs makes them particularly promising as a vaccine platform.
Finally, we would like to note that our laboratory continues to investigate multiple aspects of migrasome biology. In collaboration with our colleagues, we recently completed a study elucidating the mechanical forces involved in migrasome formation (DOI: 10.1016/j.bpj.2024.12.029), which further complements the findings presented here.
Reviewer #2 (Public review):
Summary:
The authors' report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle in using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultured, cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARSCoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.
Strengths:
The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to form engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done including thermal stability and characterization of the particle size (important characterizations for a good vaccine).
Weaknesses:
With a new vaccine platform technology, it would be nice to compare them head-tohead against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome-based vaccine could elicit responses comparable to a proven vaccine technology.
We thank the reviewer for the thoughtful evaluation and constructive suggestions, which have helped us strengthen the manuscript.
Comparison with proven vaccine technologies:
In response to the reviewer’s comment, we now include a direct comparison of the antibody responses elicited by eMig-Spike and a conventional recombinant S1 protein vaccine formulated with Alum. As shown in the revised manuscript (Author response image 1), the levels of S1-specific IgG induced by the eMig-based platform were comparable to those induced by the S1+Alum formulation. This comparison supports the potential of eMigs as a competitive alternative to established vaccine platforms.
Author response image 1.
eMigrasome-based vaccination showed similar efficacy compared with adjuvanted recombinant spike protein The amount of S1-specific IgG in mouse serum was quantified by ELISA on day 14 after immunization. Mice were either intraperitoneally (i.p.) immunized with recombinant Alum/S1 or intravenously (i.v.) immunized with eM-NC, eM-S or recombinant S1. The administered doses were 20 µg/mouse for eMigrasomes, 10 µg/mouse (i.v.) or 50 µg/mouse (i.p.) for recombinant S1 and 50 µl/mouse for Aluminium adjuvant.
Assessment of antigen integrity on migrasomes:
To address the reviewer’s suggestion regarding antigen integrity, we performed immunoblotting using antibodies against both S1 and mCherry. Two distinct bands were observed: one at the expected molecular weight of the S-mCherry fusion protein, and a higher molecular weight band that may represent oligomerized or higher-order forms of the Spike protein (Figure 5b in the revised manuscript).
Furthermore, we performed confocal microscopy using a monoclonal antibody against Spike (anti-S). Co-localization analysis revealed strong overlap between the mCherry fluorescence and anti-Spike staining, confirming the proper presentation and surface localization of intact S-mCherry fusion protein on eMigs (Figure 5c in the revised manuscript). These results confirm the structural integrity and antigenic fidelity of the Spike protein expressed on eMigs.
Recommendations for the authors
Reviewer #1 (Recommendations For The Authors):
I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.
I know that the reviewers always ask for more, and this is not the case here. Can the abstract and title be changed to emphasize the science behind migrasome formation, and possibly add a few more fundamental aspects on how hypotonic shock induces migrasomes?
Alternatively, if the authors desire to maintain the emphasis on vaccines, can immunological mechanisms be somewhat expanded in order to - at least to some extent - explain why migrasomes are a better vaccine vehicle?
One way or another, this reviewer is highly supportive of this study and it is really up to the authors and the editor to decide whether my comments are of use or not.
My recommendation is to go ahead with publishing after some adjustments as per above.
We’d like to thank the reviewer for the suggestion. We have changed the title of the manuscript and modified the abstract, emphasizing the fundamental science behind the development of eMigrasome. To gain some immunological information on eMig illucidated antibody responses, we characterized the type of IgG induced by eM-OVA in mice, and compared it to that induced by Alum/OVA. The IgG response to Alum/OVA was dominated by IgG1. Quite differently, eM-OVA induced an even distribution of IgG subtypes, including IgG1, IgG2b, IgG2c, and IgG3 (Figure 4i in the revised manuscript). The ratio between IgG1 and IgG2a/c indicates a Th1 or Th2 type humoral immune response. Thus, eM-OVA immunization induces a balance of Th1/Th2 immune responses.
Reviewer #2 (Recommendations For The Authors):
The study is a very nice exploration of a new vaccine platform. This reviewer believes that a more head-to-head comparison to the current vaccine SARS-CoV-2 vaccine platform would improve the manuscript. This comparison is done with OVA antigen, but this model antigen is not as exciting as a functional head-to-head with a SARS-CoV-2 vaccine.
I think that two other discussion points should be included in the manuscript. First, was the host-cell protein evaluated? If not, I would include that point on how issues of host cell contamination of the migrasome could play a role in the responses and safety of a vaccine. Second, I would discuss antigen incorporation and localization into the platform. For example, the full-length spike being expressed has a native signal peptide and transmembrane domain. The authors point out that a transmembrane domain can be added to display an antigen that does not have one natively expressed, however, without a signal peptide this would not be secreted and localized properly. I would suggest adding a discussion of how a non-native signal peptide would be necessary in addition to a transmembrane domain.
We thank the reviewer for these thoughtful suggestions and fully agree that the points raised are important for the translational development of eMig-based vaccines.
(1) Host cell proteins and potential immunogenicity:
We appreciate the reviewer’s suggestion to consider host cell protein contamination. Considering potential clinical application of eMigrasomes in the future, we will use human cells with low immunogenicity such as HEK-293 or embryonic stem cells (ESCs) to generate eMigrasomes. Also, we will follow a QC that meets the standard of validated EV-based vaccination techniques.
(2) Antigen incorporation and localization—signal peptide and transmembrane domain:
We also agree with the reviewer’s point that proper surface display of antigens on eMigs requires both a transmembrane domain and a signal peptide for correct trafficking and membrane anchoring. For instance, in the case of full-length Spike protein, the native signal peptide and transmembrane domain ensure proper localization to the plasma membrane and subsequent incorporation into eMigs. In case of OVA, a secretary protein that contains a native signal peptide yet lacks a transmembrane domain, an engineered transmembrane domain is required. For antigens that do not naturally contain these features, both a non-native signal peptide and an artificial transmembrane domain are necessary. We have clarified this point in the revised discussion and explicitly noted the requirement for a signal peptide when engineering antigens for surface display on migrasomes.
eLife Assessment
This paper reports the fundamental finding of how Raman spectral patterns correlate with proteome profiles using Raman spectra of E. coli cells from different physiological conditions and found global stoichiometric regulation on proteomes. The authors' findings provide compelling evidence that stoichiometric regulation of proteomes is general through analysis of both bacterial and human cells. In the future, similar methodology can be applied on various tissue types and microbial species for studying proteome composition with Raman spectral patterns.
Reviewer #1 (Public review):
Summary
This work performed Raman spectral microscopy for E. coli cells with 15 different culture conditions. The author developed a theoretical framework to construct a regression matrix which predicts proteome composition by Raman data. Specifically, this regression matrix is obtained by statistical inference from various experimental conditions. With this model, the authors categorized co-expressed genes and illustrate how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase dependent genes. Overall, the author demonstrates a strong and comprehensive data analysis scheme for the joint analysis of Raman and proteome datasets.
Strengths and major contributions
Major contributions: (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions. (2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning were investigated.
Discussion and impact for the field
Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition biomolecules. This work is a strong initiative for introducing the powerful technique to systems biology and provide a rigorous pipeline for future data analysis. The regression matrix can be used for cross-comparison among future experimental results on proteome-Raman datasets.
Comments on revisions:
The authors addressed all my questions nicely. In particular, the subsampling test demonstrated that with enough "distinct" physiological condition (even for m=5) one could already explore the major mode of proteome regulation and Raman signature. The main text has been streamlined and the clarity is improved. I have a minor suggestion:
(i) For equation (1), it is important to emphasize that the formula works for every j=1,...,15, and the regression matrix B is obtained by statistical inference by summarizing data from all 15 conditions.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
Summary
This work performed Raman spectral microscopy at the single-cell level for 15 different culture conditions in E. coli. The Raman signature is systematically analyzed and compared with the proteome dataset of the same culture conditions. With a linear model, the authors revealed correspondence between Raman pattern and proteome expression stoichiometry indicating that spectrometry could be used for inferring proteome composition in the future. With both Raman spectra and proteome datasets, the authors categorized co-expressed genes and illustrated how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase-dependent genes. Overall, the authors demonstrate a strong and solid data analysis scheme for the joint analysis of Raman and proteome datasets.
Strengths and major contributions
(1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions.
(2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning was investigated.
Weaknesses
The experimental measurements of Raman microscopy were conducted at the single-cell level; however, the analysis was performed by averaging across the cells. The author did not discuss if Raman microscopy can used to detect cell-to-cell variability under the same condition.
We thank the reviewer for raising this important point. Though this topic is beyond the scope of our study, some of our authors have addressed the application of single-cell Raman spectroscopy to characterizing phenotypic heterogeneity in individual Staphylococcus aureus cells in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718). Additionally, one of our authors demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, detecting cell-to-cell variability under the same conditions has been shown to be feasible. Whether averaging single-cell Raman spectra is necessary depends on the type of analysis and the available dataset. We will discuss this in more detail in our response to Comment (1) by Reviewer #1 (Recommendation for the authors).
Discussion and impact on the field
Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition of biomolecules. It has the advantage that single-cell level data could be acquired and both in vivo and in vitro data can be compared. This work is a strong initiative for introducing the powerful technique to systems biology and providing a rigorous pipeline for future data analysis.
Reviewer #2 (Public review):
Summary and strengths:
Kamei et al. observe the Raman spectra of a population of single E. coli cells in diverse growth conditions. Using LDA, Raman spectra for the different growth conditions are separated. Using previously available protein abundance data for these conditions, a linear mapping from Raman spectra in LDA space to protein abundance is derived. Notably, this linear map is condition-independent and is consequently shown to be predictive for held-out growth conditions. This is a significant result and in my understanding extends the earlier Raman to RNA connection that has been reported earlier.
They further show that this linear map reveals something akin to bacterial growth laws (ala Scott/Hwa) that the certain collection of proteins shows stoichiometric conservation, i.e. the group (called SCG - stoichiometrically conserved group) maintains their stoichiometry across conditions while the overall scale depends on the conditions. Analyzing the changes in protein mass and Raman spectra under these conditions, the abundance ratios of information processing proteins (one of the large groups where many proteins belong to "information and storage" - ISP that is also identified as a cluster of orthologous proteins) remain constant. The mass of these proteins deemed, the homeostatic core, increases linearly with growth rate. Other SCGs and other proteins are condition-specific.
Notably, beyond the ISP COG the other SCGs were identified directly using the proteome data. Taking the analysis beyond they then how the centrality of a protein - roughly measured as how many proteins it is stoichiometric with - relates to function and evolutionary conservation. Again significant results, but I am not sure if these ideas have been reported earlier, for example from the community that built protein-protein interaction maps.
As pointed out, past studies have revealed that the function, essentiality, and evolutionary conservation of genes are linked to the topology of gene networks, including protein-protein interaction networks. However, to the best of our knowledge, their linkage to stoichiometry conservation centrality of each gene has not yet been established.
Previously analyzed networks, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects from different types of interactions or multilayered regulations affecting each protein species.
In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions and naturally reflect the global effects of multilayered interactions. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.
We added a paragraph explaining this important point to the Discussion section, along with additional literature.
Finally, the paper built a lot of "machinery" to connect ¥Omega_LE, built directly from proteome, and ¥Omega_B, built from Raman, spaces. I am unsure how that helps and have not been able to digest the 50 or so pages devoted to this.
The mathematical analyses in the supplementary materials form the basis of the argument in the main text. Without the rigorous mathematical discussions, Fig. 6E — one of the main conclusions of this study — and Fig. 7 could never be obtained. Therefore, we believe the analyses are essential to this study. However, we clarified why each analysis is necessary and significant in the corresponding sections of the Results to improve the manuscript's readability.
Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).
Strengths:
The rigorous analysis of the data is the real strength of the paper. Alongside this, the discovery of SCGs that are condition-independent and that are condition-dependent provides a great framework.
Weaknesses:
Overall, I think it is an exciting advance but some work is needed to present the work in a more accessible way.
We edited the main text to make it more accessible to a broader audience. Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).
Reviewer #1 (Recommendations for the authors):
(1) The Raman spectral data is measured from single-cell imaging. In the current work, most of the conclusions are from averaged data. From my understanding, once the correspondence between LDA and proteome data is established (i.e. the matrix B) one could infer the single-cell proteome composition from B. This would provide valuable information on how proteome composition fluctuates at the single-cell level.
We can calculate single-cell proteomes from single-cell Raman spectra in the manner suggested by the reviewer. However, we cannot evaluate the accuracy of their estimation without single-cell proteome data under the same environmental conditions. Likewise, we cannot verify variations of estimated proteomes of single cells. Since quantitatively accurate single-cell proteome data is unavailable, we concluded that addressing this issue was beyond the scope of this study.
Nevertheless, we agree with the reviewer that investigating how proteome composition fluctuates at the single-cell level based on single-cell Raman spectra is an intriguing direction for future research. In this regard, some of our authors have studied the phenotypic heterogeneity of Staphylococcus aureus cells using single-cell Raman spectra in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718), and one of our authors has demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, it is highly plausible that single-cell Raman spectroscopy can also characterize proteomic fluctuations in single cells. We have added a paragraph to the Discussion section to highlight this important point.
(2) The establishment of matrix B is quite confusing for readers who only read the main text. I suggest adding a flow chart in Figure 1 to explain the data analysis pipeline, as well as state explicitly what is the dimension of B, LDA matrix, and proteome matrix.
We thank the reviewer for the suggestion. Following the reviewer's advice, we have explicitly stated the dimensions of the vectors and matrices in the main text. We have also added descriptions of the dimensions of the constructed spaces. Rather than adding another flow chart to Figure 1, we added a new table (Table 1) to explain the various symbols representing vectors and matrices, thereby improving the accessibility of the explanation.
(3) One of the main contributions for this work is to demonstrate how proteome stoichiometry is regulated across different conditions. A total of m=15 conditions were tested in this study, and this limits the rank of LDA matrix as 14. Therefore, maximally 14 "modes" of differential composition in a proteome can be detected.
As a general reader, I am wondering in the future if one increases or decreases the number of conditions (say m=5 or m=50) what information can be extracted? It is conceivable that increasing different conditions with distinct cellular physiology would be beneficial to "explore" different modes of regulation for cells. As proof of principle, I am wondering if the authors could test a lower number (by sub-sampling from m=15 conditions, e.g. picking five of the most distinct conditions) and see how this would affect the prediction of proteome stoichiometry inference.
We thank the reviewer for bringing an important point to our attention. To address the issue raised, we conducted a new subsampling analysis (Fig. S14).
As we described in the main text (Fig. 6E) and the supplementary materials, the m x m orthogonal matrix, Θ, represents to what extent the two spaces Ω<sub>LE</sub> and Ω<sub>B</sub> are similar (m is the number of conditions; in our main analysis, m = 15). Thus, the low-dimensional correspondence between the two spaces connected by an orthogonal transformation, such as an m-dimensional rotation, can be evaluated by examining the elements of the matrix Θ. Specifically, large off-diagonal elements of the matrix mix higher dimensions and lower dimensions, making the two spaces spanned by the first few major axes appear dissimilar. Based on this property, we evaluated the vulnerability of the low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> to the reduced number of conditions by measuring how close Θ was to the identity matrix when the analysis was performed on the subsampled datasets.
In the new figure (Fig. S14), we first created all possible smaller condition sets by subsampling the conditions. Next, to evaluate the closeness between the matrix Θ and the identity matrix for each smaller condition set, we generated 10,000 random orthogonal matrices of the same size as . We then evaluated the probability of obtaining a higher level of low-dimensional correspondence than that of the experimental data by chance (see section 1.8 of the Supplementary Materials). This analysis was already performed in the original manuscript for the non-subsampled case (m = 15) in Fig. S9C; the new analysis systematically evaluates the correspondence for the subsampled datasets.
The results clearly show that low-dimensional correspondence is more likely to be obtained with more conditions (Fig. S14). In particular, when the number of conditions used in the analysis exceeds five, the median of the probability that random orthogonal matrices were closer to the identity matrix than the matrix Θ calculated from subsampled experimental data became lower than 10<sup>-4</sup>. This analysis provides insight into the number of conditions required to find low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub>.
What conditions are used in the analysis can change the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> . Therefore, it is important to clarify whether including more conditions in the analysis reduces the dependence of the low-dimensional structures on conditions. We leave this issue as a subject for future study. This issue relates to the effective dimensionality of omics profiles needed to establish the diverse physiological states of cells across conditions. Determining the minimum number of conditions to attain the condition-independent low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> would provide insight into this fundamental problem. Furthermore, such an analysis would identify the range of applications of Raman spectra as a tool for capturing macroscopic properties of cells at the system level.
We now discuss this point in the Discussion section, referring to this analysis result (Fig. S14). Please also see our reply to the comment (1) by Reviewer #2 (Recommendations for the authors).
(4) In E. coli cells, total proteome is in mM concentration while the total metabolites are between 10 to 100 mM concentration. Since proteins are large molecules with more functional groups, they may contribute to more Raman signal (per molecules) than metabolites. Still, the meaningful quantity here is the "differential Raman signal" with different conditions, not the absolute signal. I am wondering how much percent of differential Raman signature are from proteome and how much are from metabolome.
It is an important and interesting question to what extent changes in the proteome and metabolome contribute to changes in Raman spectra. Though we concluded that answering this question is beyond the scope of this study, we believe it is an important topic for future research.
Raman spectral patterns convey the comprehensive molecular composition spanning the various omics layers of target cells. Changes in the composition of these layers can be highly correlated, and identifying their contributions to changes in Raman spectra would provide insight into the mutual correlation of different omics layers. Addressing the issue raised by the reviewer would expand the applications of Raman spectroscopy and highlight the advantage of cellular Raman spectra as a means of capturing comprehensive multi-omics information.
We note that some studies have evaluated the contributions of proteins, lipids, nucleic acids, and glycogen to the Raman spectra of mammalian cells and how these contributions change in different states (e.g., Mourant et al., J Biomed Opt, 10(3), 031106, 2005). Additionally, numerous studies have imaged or quantified metabolites in various cell types (see, for example, Cutshaw et al., Chemical Reviews, 123(13), 8297–8346, 2023, for a comprehensive review). Extending these approaches to multiple omics layers in future studies would help resolve the issue raised by the reviewer.
(5) It is known that E. coli cells in different conditions have different cell sizes, where cell width increases with carbon source quality and growth rate. Does this effect be normalized when processing the Raman signal?
Each spectrum was normalized by subtracting the average and dividing it by the standard deviation. This normalization minimizes the differences in signal intensities due to different cell sizes and densities. This information is shown in the Materials and Methods section of the Supplementary Materials.
(6) I have a question about interpretation of the centrality index. A higher centrality indicates the protein expression pattern is more aligned with the "mainstream" of the other proteins in the proteome. However, it is possible that the proteome has multiple" mainstream modes" (with possibly different contributions in magnitudes), and the centrality seems to only capture the "primary mode". A small group of proteins could all have low centrality but have very consistent patterns with high conservation of stoichiometry. I wondering if the author could discuss and clarify with this.
We thank the reviewer for drawing our attention to the insufficient explanation in the original manuscript. First, we note that stoichiometry conserving protein groups are not limited to those composed of proteins with high stoichiometry conservation centrality. The SCGs 2–5 are composed of proteins that strongly conserve stoichiometry within each group but have low stoichiometry conservation centrality (Fig. 5A, 5K, 5L, and 7A). In other words, our results demonstrate the existence of the "primary mainstream mode" (SCG 1, i.e., the homeostatic core) and condition-specific "non-primary mainstream modes" (SCGs 2–5). These primary and non-primary modes are distinguishable by their position along the axis of stoichiometry conservation centrality (Fig. 5A, 5K, and 5L).
However, a single one-dimensional axis (centrality) cannot capture all characteristics of stoichiometry-conserving architecture. In our case, the "non-primary mainstream modes" (SCGs 2–5) were distinguished from each other by multiple csLE axes.
To clarify this point, we modified the first paragraph of the section where we first introduce csLE (Revealing global stoichiometry conservation architecture of the proteomes with csLE). We also added a paragraph to the Discussion section regarding the condition-specific SCGs 2–5.
(7) Figures 3, 4, and 5A-I are analyses on proteome data and are not related to Raman spectral data. I am wondering if this part of the analysis can be re-organized and not disrupt the mainline of the manuscript.
We agree that the structure of this manuscript is complicated. Before submitting this manuscript to eLife, we seriously considered reorganizing it. However, we concluded that this structure was most appropriate because our focus on stoichiometry conservation cannot be explained without analyzing the coefficients of the Raman-proteome correspondence using COG classification (see Fig. 3; note that Fig. 3A relates to Raman data). This analysis led us to examine the global stoichiometry conservation architecture of proteomes (Figs. 4 and 5) and discover the unexpected similarity between the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub>
Therefore, we decided to keep the structure of the manuscript as it is. To partially resolve this issue, however, we added references to Fig. S1, the diagram of this paper’s mainline, to several places in the main text so that readers can more easily grasp the flow of the manuscript.
(8) Supplementary Equation (2.6) could be wrong. From my understanding of the coordinate transformation definition here, it should be [w1 ... ws] X := RHS terms in big parenthesis.
We checked the equation and confirmed that it is correct.
Reviewer #2 (Recommendations for the authors):
(1) The first main result or linear map between raman and proteome linked via B is intriguing in the sense that the map is condition-independent. A speculative question I have is if this relationship may become more complex or have more condition-dependent corrections as the number of conditions goes up. The 15 or so conditions are great but it is not clear if they are often quite restrictive. For example, they assume an abundance of most other nutrients. Now if you include a growth rate decrease due to nitrogen or other limitations, do you expect this to work?
In our previous paper (Kobayashi-Kirschvink et al., Cell Systems 7(1): 104–117.e4, 2018), we statistically demonstrated a linear correspondence between cellular Raman spectra and transcriptomes for fission yeast under 10 environmental conditions. These conditions included nutrient-rich and nutrient-limited conditions, such as nitrogen limitation. Since the Raman-transcriptome correspondence was only statistically verified in that study, we analyzed the data from the standpoint of stoichiometry conservation in this study. The results (Fig. S11 and S12) revealed a correspondence in lower dimensions similar to that observed in our main results. In addition, similar correspondences were obtained even for different E. coli strains under common culture conditions (Fig. S11 and S12). Therefore, it is plausible that the stoichiometry-conservation low-dimensional correspondence between Raman and gene expression profiles holds for a wide range of external and internal perturbations.
We agree with the reviewer that it is important to understand how Raman-omics correspondences change with the number of conditions. To address this issue, we examined how the correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> changes by subsampling the conditions used in the analysis. We focused on , which was introduced in Fig. 5E, because the closeness of Θ to the identity matrix represents correspondence precision. We found a general trend that the low-dimensional correspondence becomes more precise as the number of conditions increases (Fig. S14). This suggests that increasing the number of conditions generally improves the correspondence rather than disrupting it.
We added a paragraph to the Discussion section addressing this important point. Please also refer to our response to Comment (3) of Reviewer #1 (Recommendations for the authors).
(2) A little more explanation in the text for 3C/D would help. I am imagining 3D is the control for 3C. Minor comment - 3B looks identical to S4F but the y-axis label is different.
We thank the reviewer for pointing out the insufficient explanation of Fig. 3C and 3D in the main text. Following this advice, we added explanations of these plots to the main text. We also added labels ("ISP COG class" and "non-ISP COG class") to the top of these two figures.
Fig. 3B and S4F are different. For simplicity, we used the Pearson correlation coefficient in Fig. 3B. However, cosine similarity is a more appropriate measure for evaluating the degree of conservation of abundance ratios. Thus, we presented the result using cosine similarity in a supplementary figure (Fig. S4F). Please note that each point in Fig. S4F is calculated between proteome vectors of two conditions. The dimension of each proteome vector is the number of genes in each COG class.
(3) Can we see a log-log version of 4C to see how the low-abundant proteins are behaving? In fact, the same is in part true for Figure 3A.
We added the semi-log version of the graph for SCG1 (the homeostatic core) in Fig. 4C to make low-abundant proteins more visible. Please note that the growth rates under the two stationary-phase conditions were zero; therefore, plotting this graph in log-log format is not possible.
Fig. 3A cannot be shown as a log-log plot because many of the coefficients are negative. The insets in the graphs clarify the points near the origin.
(4) In 5L, how should one interpret the other dots that are close to the center but not part of the SCG1? And this theme continues in 6ACD and 7A.
The SCGs were obtained by setting a cosine similarity threshold. Therefore, proteins that are close to SCG 1 (the homeostatic core) but do not belong to it have a cosine similarity below the threshold with any protein in SCG 1. Fig. 7 illustrates the expression patterns of the proteins in question.
(5) Finally, I do not fully appreciate the whole analysis of connecting ¥Omega_csLE and ¥Omega_B and plots in 6 and 7. This corresponds to a lot of linear algebra in the 50 or so pages in section 1.8 in the supplementary. If the authors feel this is crucial in some way it needs to be better motivated and explained. I philosophically appreciate developing more formalism to establish these connections but I did not understand how this (maybe even if in the future) could lead to a new interpretation or analysis or theory.
The mathematical analyses included in the supplementary materials are important for readers who are interested in understanding the mathematics behind our conclusions. However, we also thought these arguments were too detailed for many readers when preparing the original submission and decided to show them in the supplemental materials.
To better explain the motivation behind the mathematical analyses, we revised the section “Representing the proteomes using the Raman LDA axes”.
Please also see our reply to the comment (6) by Reviewer #2 (Recommendations for the authors) below.
(6) Along the lines of the previous point, there seems to be two separate points being made: a) there is a correspondence between Raman and proteins, and b) we can use the protein data to look at centrality, generality, SCGs, etc. And the two don't seem to be linked until the formalism of ¥Omegas?
The reviewer is correct that we can calculate and analyze some of the quantities introduced in this study, such as stoichiometry conservation centrality and expression generality, without Raman data. However, it is difficult to justify introducing these quantities without analyzing the correspondence between the Raman and proteome profiles. Moreover, the definition of expression generality was derived from the analysis of Raman-proteome correspondence (see section 2.2 of the Supplementary Materials). Therefore, point b) cannot stand alone without point a) from its initial introduction.
To partially improve the readability and resolve the issue of complicated structure of this manuscript, we added references to Fig. S1, which is a diagram of the paper’s mainline, to several places in the main text. Please also see our reply to the comment (7) by Reviewer #1 (Recommendations for the authors).
eLife Assessment
The authors analyzed spectral properties of neural activity recorded using laminar probes while mice engaged in a global/local visual oddball paradigm. They found solid evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards a "predictive routing" scheme. The study is overall important because it addresses the basis of predictive processing in the cortex, but some of the analytical choices could be better motivated, and overall, the manuscript can be improved by performing additional analyses.
Reviewer #1 (Public review):
Summary:
The authors recorded neural activity using laminar probes while mice engaged in a global/local visual oddball paradigm. The focus of the article is on oscillatory activity, and found activity differences in theta, alpha/beta, and gamma bands related to predictability and prediction error.
I think this is an important paper, providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise in the sensory cortex.
Comments:
Below are some comments that may hopefully help further improve the quality of this already very interesting manuscript.
(1) Introduction:
The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?
(2) Limited propagation of gamma band signals:
Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?
(3) Paradigm:
While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).
A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).
I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.
(4) Reporting of results:
I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.
(5) Cluster test:
The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?
Reviewer #2 (Public review):
Summary:
Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).
(1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)
(2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and
(3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).
They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.
While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:
(1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.
(2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.
(3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).
(4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?
(5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.
(6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.
(7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.
(8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?
(9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.
(10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.
(11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.
Reviewer #3 (Public review):
Summary:
In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.
Strengths:
(1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.
(2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.
Weaknesses:
(1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).
(2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.
(3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.
(4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).
(5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.
(6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.
Author response:
We would like to thank the three Reviewers for their thoughtful comments and detailed feedback. We are pleased to hear that the Reviewers found our paper to be “providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise” (R1), “well-suited to test evidence for predictive coding versus alternative hypotheses” (R2), and “timely and interesting” (R3).
We perceive that the reviewers have an overall positive impression of the experiments and analyses, but find the text somewhat dense and would like to see additional statistical rigor, as well as in some cases additional analyses to be included in supplementary material. We therefore here provide a provisional letter addressing revisions we have already performed and outlining the revision we are planning point-by-point. We begin each enumerated point with the Reviewer’s quoted text and our responses to each point are made below.
Reviewer 1:
(1) Introduction:
The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?”
We have edited the text to indicate that theta-band activity has been related to prediction error processing as an empirical observation, and must regrettably leave drawing inferences about its functional role to future work, with experiments designed specifically to draw out theta-band activity.
(2) Limited propagation of gamma band signals:
Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?”
We have not specifically claimed that gamma propagates between columns/areas in our recordings, only that it synchronizes synaptic current flows between laminar layers within a column/area. We nonetheless suggest that gamma can locally synchronize a column, and potentially local columns within an area via entrainment of local recurrent spiking, to update an internal prediction/representation upon onset of a prediction error. We also point the Reviewer to our Discussion section, where we state that our results fit with a model “whereby θ oscillations synchronize distant areas, enabling them to exchange relevant signals during cognitive processing.” In our present work, we therefore remain agnostic about whether theta or gamma or both (or alternative mechanisms) are at play in terms of how prediction error signals are transmitted between areas.
(3) Paradigm:
While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).
A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).
I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.”
We appreciate the Reviewer’s concern about working memory limitations in mice. Our paradigm and training followed on from previous paradigms such as Gavornik and Bear (2014), in which predictive effects were observed in mouse V1 with presentation times of 150ms and interstimulus intervals of 1500ms. In addition, we note that Jamali et al. (2024) recently utilized a similar global/local paradigm in the auditory domain with inter-sequence intervals as long as 28-30 seconds, and still observed effects of a predicted sequence (https://elifesciences.org/articles/102702). For the revised manuscript, we plan to expand on this in the Discussion section.
That being said, as the Reviewer also pointed out, this would be a greater concern had we not found any positive findings in our study. However, even with the rather long sequence periods we used, we did find positive evidence for predictive effects, supporting the use of our current paradigm. We agree with the reviewer that these positive effects are easier to interpret than negative effects, and plan to expand upon this in the Discussion when we resubmit.
(4) Reporting of results:
I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.”
For the revised manuscript, we can include the p-values after cluster-based testing for each significant cluster, as well as show data that passes a more stringent threshold of p<0.001 (1/1000) or p<0.005 (1/200) rather than our present p<0.01 (1/100).
(5) Cluster test:
The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?”
We are happy to include a 3D plot of a time-channel-frequency cluster in the revised manuscript to clarify our statistical approach for the reviewer. We consider our current three-dimensional cluster-testing an “unsupervised” way of uncovering significant contrasts with no theory-driven assumptions about which bounded frequency bands or layers do what.
Reviewer 2:
Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).
(1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)
(2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and
(3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).
They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.
While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:”
We appreciate the reviewer’s concerns and outline how we will address them below:
(1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.”
We have clarified in the manuscript that while the gamma-as-prediction hypothesis (our H2) was originally proposed in a spatial prediction domain, further work (specifically Singer (2021)) has extended the hypothesis to cover temporal-domain predictions as well.
To address the reviewer’s point about multiple features in the spectral domain: Our analysis has specifically separated aperiodic components using FOOOF analysis (Supp. Fig. 1) and explicitly fit and tested aperiodic vs. periodic components (Supp. Figs 1&2). We did not find strong effects in the aperiodic components but did in the periodic components (Supp. Fig. 2), allowing us to be more confident in our conclusions in terms of genuine narrow-band oscillations. In the revised manuscript, we will include analysis of the pre-stimulus time window to address the reviewer’s point (iv) on sustained low frequency oscillations.
(2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.
In our revised manuscript we will include a pre-stimulus analysis and supplementary figures with time ranges from -500ms to 500ms. We have only refrained from doing so in the initial manuscript because our paradigm’s short interstimulus interval makes it difficult to interpret whether activity in the ISI reflects post-stimulus dynamics or pre-stimulus prediction. Nonetheless, we can easily show that in our paradigm, alpha/beta-band activity is elevated in the interstimulus activity after the offset of the previous stimulus, assuming that we baseline to the pre-trial period.
(3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).
We have included an extra sentence in our Materials and Methods section clarifying that the evoked potential/ERP was removed in our existing analyses, prior to performing the spectral decomposition of the LFP signal. We also note that the FOOOF analysis we applied separates aperiodic components of the spectral signal from the strictly oscillatory ones.
In our revised manuscript we will include an analysis of the evoked responses as suggested by the reviewer.
(4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?
As noted above to Reviewer 1 (point 4), we are happy to include supplemental figures in our resubmission showing the effects on our results of setting the statistical significance threshold with considerably greater stringency.
(5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.
In order to check for the brain-wide effects of arousal, we plan to perform similar analyses to our existing ones on the 3rd stimulus in each block, rather than just the 4th “oddball” stimulus. Clusters that appear significantly contrasting in both the 3rd and 4th stimuli may be attributable to arousal. We will also analyze pupil size as an index of arousal to check for arousal differences between conditions in our contrasts, possibly stratifying our data before performing comparisons to equalize pupil size within contrasts. We plan to include these analyses in our resubmission.
(6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.
Since many different predictive coding and predictive processing hypotheses make very different hypotheses about how predictions might encoded in neurophysiological recordings, we have focused on prediction error encoding in this paper.
For the hypothesis space we have considered (H1-H3), each hypothesis makes clearly distinguishable predictions about the spectral response during the time period in the task when prediction errors should be present. As noted by the reviewer, a transient increase in broadband frequencies would be a signature of H3. Changes to oscillatory power in the gamma band in distinct directions (e.g., increasing or decreasing with prediction error) would support either H1 and H2, depending on the direction of change. We believe our data, especially our use of FOOOF analysis and separation of periodic from aperiodic components, coupled to the three experimental contrasts, speaks clearly in favor of the Predictive Routing model, but we do not claim we have “proved” it. This study provides just one datapoint, and we will acknowledge this in our revised Discussion in our resubmission.
(7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.
We consider a feedforward pattern as flowing from L4 outwards to L2/3 and L5/6, and a feedback pattern as flowing in the opposite direction, from L1 and L6 to the middle layers. We will clarify this in the revised manuscript.
Since gamma-band oscillations are strongest in L2/3, we re-epoched LFPs to the oscillation troughs in L2/3 in the initial manuscript. We can include in the revised manuscript equivalent plots after finding oscillation troughs in L4 instead, as well as calculating the difference in trough times within-band between layers to quantify the transmission delay and add additional rigor to our feedforward vs. feedback interpretation of the CSD data.
(8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?
We are looking into the clim/colorbar and plot-generation code to figure out the visibility issue that the Reviewer has kindly pointed out to us.
(9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.
We will add the requested bar-plots for all frequency ranges, though we note that the bars given here are the results of adding up the spectral power in the channel-time-frequency clusters that already passed significance tests and that adding secondary significance tests here may not prove informative.
(10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.
As noted above, we will include the requested bar plot, as well as examining alpha/beta in the pre-stimulus time-series rather than after the onset of the oddball stimulus.
(11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.
We will include for the Reviewer’s edification a supplementary figure showing the spectrograms for the baseline and full-trial periods to look at the difference between baseline and prestimulus expectation.
Reviewer 3:
Summary:
In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.
Strengths:
(1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.
(2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.”
We thank the Reviewer for their kind comments.
Weaknesses:
(1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).
We agree that in this manuscript we should restrict ourselves to the hypotheses that were directly tested. We have revised our abstract accordingly, and softened our claim to note only that our LFP results are compatible with predictive routing.
(2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.
In our revised manuscript, we will either substantiate (with quantification of CSD delays between layers) or soften the claims about feedforward/feedback direction of flow within the cortical column.
(3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.
In our revised manuscript we will gladly include a supplementary figure showing the right-column difference plots across areas, in order to make sure to include aspects of our dataset that span up and down the cortical hierarchy.
(4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).
We appreciate the Reviewer’s concern for statistical rigor, and as noted to the other reviewers, we can add different levels of statistical description and describe the p-values associated with specific clusters. Regarding Figure 5, we must protest as the bar heights were computed came from clusters already subjected to statistical testing and found significant. We could add a supplementary figure which considers untested narrowband activity and tests it only in the “bar height” domain, if the Reviewer would like.
(5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.
We have added some elaboration to our Materials and Methods section, especially to specify that CSD, having physical rather than arbitrary units, does not require baselining.
(6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.
In the revised manuscript we will add a visual aid for the three contrasts we consider.
We are happy to inform the editors that we have implemented, for the Reviewed Preprint, the direct textual Recommendations for the Authors given by Reviewers 2 and 3. We will implement the suggested Figure changes in our revised manuscript. We thank them for their feedback in strengthening our manuscript.
eLife Assessment
Mark and colleagues developed and validated a valuable method for examining subspace generalization in fMRI data and applied it to understand whether the entorhinal cortex uses abstract representations that generalize across different environments with the same structure. The manuscript presents convincing evidence for the conclusion that abstract entorhinal representations of hexagonal associative structures generalize across different stimulus sets.
Reviewer #1 (Public review):
Summary:
This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.
Strengths:
(1) The overall topic is very interesting and timely and the manuscript is well written.
(2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.
(3) The findings provide new insights into generalizable neural representations of abstract task states in entorhinal cortex.
Weaknesses:
(1) There are two design confounds that are not sufficiently discussed.
(1.1) First, hexagonal and community structures are confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could be explained (in theory) by order effects (although this is unlikely). However, because community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why there was no generalization across graph size for community structures.
(1.2) Second, subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another possible reason why there was no generalization for the community structure.
(2) The authors include the results from a searchlight analysis to show specificity of the effects for EC. A more convincing way (in my opinion) to show specificity would be to test for (and report the results) of a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC). This would substantiate the point that EC activity generalizes across structural similarity while sensory regions like LOC generalize across visual similarity.
Reviewer #2 (Public review):
Summary:
Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then show perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to an fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was not evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal that entorhinal representations that emerge across specific shared structural conditions.
Strengths:
I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.
Weaknesses:
The primary weakness of the paper in terms of empirical results is that the representations identified in EC had no clear relationship to behavior, raising questions about their functional importance.
The method developed is a clearly valuable tool that can serve as part of a larger battery of analysis techniques, but a small weakness on the methodological side is that for a given dataset, it might be hard to determine whether the method developed here would be better or worse than alternative methods.
Reviewer #3 (Public review):
Summary:
The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.
Strengths:
At first glance, a potential limitation of this study appears to be its application of analytical methods originally developed for high-resolution animal electrophysiology (Samborska et al., 2022) to the relatively coarse and noisy signals of human fMRI. Rather than sidestepping this issue, however, the authors embrace it as a methodological challenge. They provide compelling empirical evidence and biologically grounded simulations to show that key generalization properties of entorhinal cortex representations can still be robustly detected. This not only validates their approach but also demonstrates how far non-invasive human neuroimaging can be pushed. The use of multiple independent datasets and carefully controlled permutation tests further underscores the reliability of their findings, making a strong case that structural generalization across diverse task environments can be meaningfully studied even in abstract, non-spatial domains that are otherwise difficult to investigate in animal models.
Weaknesses:
While this study provides compelling evidence for structural generalization in the entorhinal cortex (EC), several limitations remain that pave the way for promising future research. One issue is that the generalization effect was statistically robust in only one task condition, with weaker effects observed in the "community" condition. This raises the question of whether the null result genuinely reflects a lack of EC involvement, or whether it might be attributable to other factors such as task complexity, training order, or insufficient exposure possibilities that the authors acknowledge as open questions. Moreover, although the study leverages fMRI to examine EC representations in humans, it does not clarify which specific components of EC coding-such as grid cells versus other spatially tuned but non-grid codes-underlie the observed generalization. While electrophysiological data in animals have begun to address this, the human experiments do not disentangle the contributions of these different coding types. This leaves unresolved the important question of what makes EC representations uniquely suited for generalization, particularly given that similar effects were not observed in other regions known to contain grid cells, such as the medial prefrontal cortex (mPFC) or posterior cingulate cortex (PCC). These limitations point to important future directions for better characterizing the computational role of the EC and its distinctiveness within the broader network supporting learning and decision making based on cognitive maps.
Author response:
The following is the authors’ response to the original reviews
Public Reviews:
Reviewer #1 (Public review):
Summary:
This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from the entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.
Strengths:
(1) The overall topic is very interesting and timely and the manuscript is well-written.
(2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.
(3) The findings provide new insights into generalizable neural representations of abstract task states in the entorhinal cortex.
We thank the reviewer for their kind comments and clear summary of the paper and its strengths.
Weaknesses:
(1) The manuscript would benefit from improving the figures. Moreover, the clarity could be strengthened by including conceptual/schematic figures illustrating the logic and steps of the method early in the paper. This could be combined with an illustration of the remapping properties of grid and place cells and how the method captures these properties.
We agree with the reviewer and have added a schematic figure of the method (figure 1a).
(2) Hexagonal and community structures appear to be confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could thus be explained (in theory) by order effects (although this is practically unlikely). However, given community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why the authors did not find generalizations across graph sizes for community structures.
We thank the reviewer for their comments. We agree that the null result regarding the community structures does not mean that EC doesn’t generalise over these structures, and that the training order could in theory contribute to the lack of an effect. The decision to keep the asymmetry of the training order was deliberate: we chose this order based on our previous study (Mark et al. 2020), where we show that learning a community structure first changes the learning strategy of subsequent graphs. We could have perhaps overcome this by increasing the training periods, but 1) the training period is already very long; 2) there will still be asymmetry because the group that first learn community structure will struggle in learning the hexagonal graph more than vice versa, as shown in Mark et al. 2020.
We have added the following sentences on this decision to the Methods section:
“We chose to first teach hexagonal graphs for all participants and not randomize the order because of previous results showing that first learning community structure changes participants’ learning strategy (mark et al. 2020).”
(3) The authors include the results from a searchlight analysis to show the specificity of the effects of EC. A better way to show specificity would be to test for a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC).
Thanks for this suggestion. We indeed tried to run the analysis in a whole-ROI approach, but this did not result in a significant effect in EC. Importantly, we disagree with the reviewer that this is a “better way to show specificity” than the searchlight approach. In our view, the two analyses differ with respect to the spatial extent of the representation they test for. The searchlight approach is testing for a highly localised representation on the scale of small spheres with only 100 voxels. The signal of such a localised representation is likely to be drowned in the noise in an analysis that includes thousands of voxels which mostly don’t show the effect - as would be the case in the whole-ROI approach.
(4) Subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another confound, and possible reason why there was no generalization across stimulus sets for the community structure.
See our response to comment (2).
Reviewer #2 (Public review):
Summary:
Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was no evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal entorhinal representations that emerge across specific shared structural conditions.
Strengths:
I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of a random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.
We thank the reviewer for their kind comments and the clear summary of our paper.
Weaknesses:
In part because of the thorough validation procedures, the paper came across to me as a bit of a hybrid between a methods paper and an empirical one. However, I have some concerns, both on the methods development/validation side, and on the empirical application side, which I believe limit what one can take away from the studies performed.
We thank the reviewer for the comment. We agree that the paper comes across as a bit of a methods-empirical hybrid. We chose to do this because we believe (as the reviewer also points out) that there is value in both aspects of the paper.
Regarding the methods side, while I can appreciate that the authors show how the subspace generalization method "could" identify representations of theoretical interest, I felt like there was a noticeable lack of characterization of the specificity of the method. Based on the main equation in the results section of the paper, it seems like the primary measure used here would be sensitive to overall firing rates/voxel activations, variance within specific neurons/voxels, and overall levels of correlation among neurons/voxels. While I believe that reasonable pre-processing strategies could deal with the first two potential issues, the third seems a bit more problematic - as obligate correlations among neurons/voxels surely exist in the brain and persist across context boundaries that are not achieving any sort of generalization (for example neurons that receive common input, or voxels that share spatial noise). The comparative approach (ie. computing difference in the measure across different comparison conditions) helps to mitigate this concern to some degree - but not completely - since if one of the conditions pushes activity into strongly spatially correlated dimensions, as would be expected if univariate activations were responsive to the conditions, then you'd expect generalization (driven by shared univariate activation of many voxels) to be specific to that set of conditions.
We thank the reviewer for their comments. We would like to point out that we demean each voxel within all states/piles (3-pictures sequences) in a given graph/task (what the reviewer is calling “a condition”). Hence there is no shared univariate activation of many voxels in response to a graph going into the computation, and no sensitivity to the overall firing rate/voxel activation. Our calculation captures the variance across states conditions within a task (here a graph), over and above the univariate effect of graph activity. In addition, we spatially pre-whiten the data within each searchlight, meaning that noisy voxels with high noise variance will be downweighted and noise correlations between voxels are removed prior to applying our method.
A second issue in terms of the method is that there is no comparison to simpler available methods. For example, given the aims of the paper, and the introduction of the method, I would have expected the authors to take the Neuron-by-Neuron correlation matrices for two conditions of interest, and examine how similar they are to one another, for example by correlating their lower triangle elements. Presumably, this method would pick up on most of the same things - although it would notably avoid interpreting high overall correlations as "generalization" - and perhaps paint a clearer picture of exactly what aspects of correlation structure are shared. Would this method pick up on the same things shown here? Is there a reason to use one method over the other?
We thank the reviewer for this important and interesting point. We agree that calculating correlation between the upper triangular elements of the covariance or correlation matrices picks up similar, but not identical aspects of the data (see below the mathematical explanation that was added to the supplementary). When we repeated the searchlight analysis and calculated the correlation between the upper triangular entries of the Pearson correlation matrices we obtained an effect in the EC, though weaker than with our subspace generalization method (t=3.9, the effect did not survive multiple comparisons). Similar results were obtained with the correlation between the upper triangular elements of the covariance matrices(t=3.8, the effect did not survive multiple comparisons).
The difference between the two methods is twofold: 1) Our method is based on the covariance matrix and not the correlation matrix - i.e. a difference in normalisation. We realised that in the main text of the original paper we mistakenly wrote “correlation matrix” rather than “covariance matrix” (though our equations did correctly show the covariance matrix). We have corrected this mistake in the revised manuscript. 2) The weighting of the variance explained in the direction of each eigenvector is different between the methods, with some benefits of our method for identifying low-dimensional representations and for robustness to strong spatial correlations. We have added a section “Subspace Generalisation vs correlating the Neuron-by-Neuron correlation matrices” to the supplementary information with a mathematical explanation of these differences.
Regarding the fMRI empirical results, I have several concerns, some of which relate to concerns with the method itself described above. First, the spatial correlation patterns in fMRI data tend to be broad and will differ across conditions depending on variability in univariate responses (ie. if a condition contains some trials that evoke large univariate activations and others that evoke small univariate activations in the region). Are the eigenvectors that are shared across conditions capturing spatial patterns in voxel activations? Or, related to another concern with the method, are they capturing changing correlations across the entire set of voxels going into the analysis? As you might expect if the dynamic range of activations in the region is larger in one condition than the other?
This is a searchlight analysis, therefore it captures the activity patterns within nearby voxels. Indeed, as we show in our simulation, areas with high activity and therefore high signal to noise will have better signal in our method as well. Note that this is true of most measures.
My second concern is, beyond the specificity of the results, they provide only modest evidence for the key claims in the paper. The authors show a statistically significant result in the Entorhinal Cortex in one out of two conditions that they hypothesized they would see it. However, the effect is not particularly large. There is currently no examination of what the actual eigenvectors that transfer are doing/look like/are representing, nor how the degree of subspace generalization in EC may relate to individual differences in behavior, making it hard to assess the functional role of the relationship. So, at the end of the day, while the methods developed are interesting and potentially useful, I found the contributions to our understanding of EC representations to be somewhat limited.
We agree with this point, yet believe that the results still shed light on EC functionality. Unfortunately, we could not find correlation between behavioral measures and the fMRI effect.
Reviewer #3 (Public review):
Summary:
The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.
Strengths:
(1) Empirical Support: The study provides strong empirical evidence for the theoretical and computational neuroscience argument about the EC's role in structure generalization.
(2) Novel Approach: The research uses an innovative methodology and applies the same methods to three independent data sets, enhancing the robustness and reliability of the findings.
(3) Controlled Analysis: The results are robust against well-controlled data and/or permutations.
(4) Generalizability: By integrating data from different sources, the study offers a comprehensive understanding of the EC's role, strengthening the overall evidence supporting structural generalization across different task environments.
Weaknesses:
A potential criticism might arise from the fact that the authors applied innovative methods originally used in animal electrophysiology data (Samborska et al., 2022) to noisy fMRI signals. While this is a valid point, it is noteworthy that the authors provide robust simulations suggesting that the generalization properties in EC representations can be detected even in low-resolution, noisy data under biologically plausible assumptions. I believe this is actually an advantage of the study, as it demonstrates the extent to which we can explore how the brain generalizes structural knowledge across different task environments in humans using fMRI. This is crucial for addressing the brain's ability in non-spatial abstract tasks, which are difficult to test in animal models.
While focusing on the role of the EC, this study does not extensively address whether other brain areas known to contain grid cells, such as the mPFC and PCC, also exhibit generalizable properties. Additionally, it remains unclear whether the EC encodes unique properties that differ from those of other systems. As the authors noted in the discussion, I believe this is an important question for future research.
We thank the reviewer for their comments. We agree with the reviewer that this is a very interesting question. We tried to look for effects in the mPFC, but we did not obtain results that were strong enough to report in the main manuscript, but we do report a small effect in the supplementary.
Recommendations for the authors:
Reviewer #1 (Recommendations for the authors):
(1) I wonder how important the PCA on B1(voxel-by-state matrix from environment 1) and the computation of the AUC (from the projection on B2 [voxel-by-state matrix from environment 1]) is for the analysis to work. Would you not get the same result if you correlated the voxel-by-voxel correlation matrix based on B1 (C1) with the voxel-by-voxel correlation matrix based on B2 (C2)? I understand that you would not have the subspace-by-subspace resolution that comes from the individual eigenvectors, but would the AUC not strongly correlate with the correlation between C1 and C2?
We agree with the reviewer comments - see our response to reviewer 2 second issue above.
(2) There is a subtle difference between how the method is described for the neural recording and fMRI data. Line 695 states that principal components of the neuron x neuron intercorrelation matrix are computed, whereas line 888 implies that principal components of the data matrix B are computed. Of note, B is a voxel x pile rather than a pile x voxel matrix. Wouldn't this result in U being pile x pile rather than voxel x voxel?
The PCs are calculated on the neuron x neuron (or voxel x voxel) covariance matrix of the activation matrix. We’ve added the following clarification to the relevant part of the Methods:
“We calculated noise normalized GLM betas within each searchlight using the RSA toolbox. For each searchlight and each graph, we had a nVoxels (100) by nPiles (10) activation matrix (B) that describes the activation of a voxel as a result of a particular pile (three pictures’ sequence). We exploited the (voxel x voxel) covariance matrix of this matrix to quantify the manifold alignment within each searchlight.”
(3) It would be very helpful to the field if the authors would make the code and data publicly available. Please consider depositing the code for data analysis and simulations, as well as the preprocessed/extracted data for the key results (rat data/fMRI ROI data) into a publicly accessible repository.
The code is publicly available in git (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).
(4) Line 219: "Kolmogorov Simonov test" should be "Kolmogorov Smirnov test".
thanks!
(5) Please put plots in Figure 3F on the same y-axis.
(6) Were large and small graphs of a given statistical structure learned on the same days, and if so, sequentially or simultaneously? This could be clarified.
The graphs are learned on the same day. We clarified this in the Methods section.
Reviewer #2 (Recommendations for the authors):
Perhaps the advantage of the method described here is that you could narrow things down to the specific eigenvector that is doing the heavy lifting in terms of generalization... and then you could look at that eigenvector to see what aspect of the covariance structure persists across conditions of interest. For example, is it just the highest eigenvalue eigenvector that is likely picking up on correlations across the entire neural population? Or is there something more specific going on? One could start to get at this by looking at Figures 1A and 1C - for example, the primary difference for within/between condition generalization in 1C seems to emerge with the first component, and not much changes after that, perhaps suggesting that in this case, the analysis may be picking up on something like the overall level of correlations within different conditions, rather than a more specific pattern of correlations.
The nature of the analysis means the eigenvectors are organized by their contribution to the variance, therefore the first eigenvector is responsible for more variance than the other, we did not check rigorously whether the variance is then splitted equally by the remaining eigenvectors but it does not seems to be the case.
Why is variance explained above zero for fraction EVs = 0 for figure 1C (but not 1A) ? Is there some plotting convention that I'm missing here?
There was a small bug in this plot and it was corrected - thank you very much!
The authors say:
"Interestingly, the difference in AUCs was also 190 significantly smaller than chance for place cells (Figure 1a, compare dotted and solid green 191 lines, p<0.05 using permutation tests, see statistics and further examples in supplementary 192 material Figure S2), consistent with recent models predicting hippocampal remapping that is 193 not fully random (Whittington et al. 2020)."
But my read of the Whittington model is that it would predict slight positive relationships here, rather than the observed negative ones, akin to what one would expect if hippocampal neurons reflect a nonlinear summation of a broad swath of entorhinal inputs.
Smaller differences than chance imply that the remapping of place cells is not completely random.
Figure 2:
I didn't see any description of where noise amplitude values came from - or any justification at all in that section. Clearly, the amount of noise will be critical for putting limits on what can and cannot be detected with the method - I think this is worthy of characterization and explanation. In general, more information about the simulations is necessary to understand what was done in the pseudovoxel simulations. I get the gist of what was done, but these methods should clear enough that someone could repeat them, and they currently are not.
Thanks, we added noise amplitude to the figure legend and Methods.
What does flexible mean in the title? The analysis only worked for the hexagonal grid - doesn't that suggest that whatever representations are uncovered here are not flexible in the sense of being able to encode different things?
Flexible here means, flexible over stimulus’ characteristics that are not related to the structural form such as stimuli, the size of the graph etc.
Reviewer #3 (Recommendations for the authors):
I have noticed that the authors have updated the previous preprint version to include extensive simulations. I believe this addition helps address potential criticisms regarding the signal-to-noise ratio. If the authors could share the code for the fMRI data and the simulations in an open repository, it would enhance the study's impact by reaching a broader readership across various research fields. Except for that, I have nothing to ask for revision.
Thanks, the code will be publicly available: (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).
eLife Assessment
This important study advances our understanding of population-level immune responses to influenza in both children and adults. The strength of the evidence supporting the conclusions is compelling, with high-throughput profiling assays and mathematical modeling. The work will be of interest to immunologists, virologists, vaccine developers, and those working on mathematical modeling of infectious diseases.
Reviewer #1 (Public review):
The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.
Thanks to the authors for the revised version of the manuscript. A few concerns remain after the revision:
(1) We appreciate the additional computational analysis the authors have performed on normalizing the titers with the geometric mean titer for each individual, as shown in the new Supplemental Figure 6. We agree with the authors statement that, after averaging again within specific age groups, "there are no obvious age group-specific patterns." A discussion of this should be added to the revised manuscript, for example in the section "Pooled sera fail to capture the heterogeneity of individual sera," referring to the new Supplemental Figure 6.
However, we also suggested that after this normalization, patterns might emerge that are not necessarily defined by birth cohort. This possibility remains unexplored and could provide an interesting addition to support potential effects of substitutions at sites 145 and 275/276 in individuals with specific titer profiles, which as stated above do not necessarily follow birth cohort patterns.
(2) Thank you for elaborating further on the method used to estimate growth rates in your reply to the reviewers. To clarify: the reason that we infer from Fig. 5a that A/Massachusetts has a higher fitness than A/Sydney is not because it reaches a higher maximum frequency, but because it seems to have a higher slope. The discrepancy between this plot and the MLR inferred fitness could be clarified by plotting the frequency trajectories on a log-scale.
For the MLR, we understand that the initial frequency matters in assessing a variant's growth. However, when starting points of two clades differ in time (i.e., in different contexts of competing clades), this affects comparability, particularly between A/Massachusetts and A/Ontario, as well as for other strains. We still think that mentioning these time-dependent effects, which are not captured by the MLR analysis, would be appropriate. To support this, it could be helpful to include the MLR fits as an appendix figure, showing the different starting and/or time points used.
(3) Regarding my previous suggestion to test an older vaccine strain than A/Texas/50/2012 to assess whether the observed peak in titer measurements is virus-specific: We understand that the authors want to focus the scope of this paper on the relative fitness of contemporary strains, and that this additional experimental effort would go beyond the main objectives outlined in this manuscript. However, the authors explicitly note that "Adults across age groups also have their highest titers to the oldest vaccine strain tested, consistent with the fact that these adults were first imprinted by exposure to an older strain." This statement gives the impression that imprinting effects increase titers for older strains, whereas this does not seem to be true from their results, but only true for A/Texas. It should be modified accordingly.
Reviewer #2 (Public review):
This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, that will be relevant across pathogens (assuming the assay can be appropriately adapted). I only had a few comments, focused on maximising the information provided by the sera. These concerns were all addressed in the revised paper.
Reviewer #3 (Public review):
The authors use high throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. The updated manuscript has a stronger motivation, and there is substantial potential to build on this work in future research.
Comments on revisions:
I have no additional recommendations. There are several areas where the work could be further developed, which were not addressed in detail in the responses, but given this is a strong manuscript as it stands, it is fine that these aspects are for consideration only at this point.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.
Thanks for this nice summary of our paper.
The following points could be addressed in a revision:
(1) The authors conclude that much of the person-to-person and strain-to-strain variation seems idiosyncratic to individual sera rather than age groups. This point is not yet fully convincing. While the mean titer of an individual may be idiosyncratic to the individual sera, the strain-to-strain variation still reveals some patterns that are consistent across individuals (the authors note the effects of substitutions at sites 145 and 275/276). A more detailed analysis, removing the individual-specific mean titer, could still show shared patterns in groups of individuals that are not necessarily defined by the birth cohort.
As the reviewer suggests, we normalized the titers for all sera to the geometric mean titer for each individual in the US-based pre-vaccination adults and children. This is only for the 2023-circulating viral strains. We then faceted these normalized titers by the same age groups we used in Figure 6, and the resulting plot is shown. Although there are differences among virus strains (some are better neutralized than others), there are not obvious age group-specific patterns (eg, the trends in the two facets are similar). This observation suggests that at least for these relatively closely related recent H3N2 strains, the strain-to-strain variation does not obviously segregate by age group. Obviously, it is possible (we think likely) that there would be more obvious age-group specific trends if we looked at a larger swath of viral strains covering a longer time range (eg, over decades of influenza evolution). We have added the new plots shown as a Supplemental Figure 6 in the revised manuscript.
(2) The authors show that the fraction of sera with a titer 138 correlates strongly with the inferred growth rate using MLR. However, the authors also note that there exists a strong correlation between the MLR growth rate and the number of HA1 mutations. This analysis does not yet show that the titers provide substantially more information about the evolutionary success. The actual relation between the measured titers and fitness is certainly more subtle than suggested by the correlation plot in Figure 5. For example, the clades A/Massachusetts and A/Sydney both have a positive fitness at the beginning of 2023, but A/Massachusetts has substantially higher relative fitness than A/Sydney. The growth inference in Figure 5b does not appear to map that difference, and the antigenic data would give the opposite ranking. Similarly, the clades A/Massachusetts and A/Ontario have both positive relative fitness, as correctly identified by the antigenic ranking, but at quite different times (i.e., in different contexts of competing clades). Other clades, like A/St. Petersburg are assigned high growth and high escape but remain at low frequency throughout. Some mention of these effects not mapped by the analysis may be appropriate.
Thanks for the nice summary of our findings in Figure 5. However, the reviewer is misreading the growth charts when they say that A/Massachusetts/18/2022 has a substantially higher fitness than A/Sydney/332/2023. Figure 5a (reprinted at left panel) shows the frequency trajectory of different variants over time. While A/Massachusetts/18/2022 reaches a higher frequency than A/Sydney/332/2023, the trajectory is similar and the reason that A/Massachusetts/18/2022 reached a higher max frequency is that it started at a higher frequency at the beginning of 2023. The MLR growth rate estimates differ from the maximum absolute frequency reached: instead, they reflect how rapidly each strain grows relative to others. In fact, A/Massachusetts/18/2022 and A/Sydney/332/2023 have similar growth rates, as shown in Supplemental Figure 6b (reprinted at right). Similarly, A/Saint-Petersburg/RII-166/2023 starts at a low initial frequency but then grows even as A/Massachusetts/18/2022 and A/Sydney/332/2023 are declining, and so has a higher growth rate than both of those.
In the revised manuscript, we have clarified how viral growth rates are estimated from frequency trajectories, and how growth rate differs from max frequency in the text below:
“To estimate the evolutionary success of different human H3N2 influenza strains during 2023, we used multinomial logistic regression, which analyzes strain frequencies over time to calculate strain-specific relative growth rates [51–53]. There were sufficient sequencing counts to reliably estimate growth rates in 2023 for 12 of the HAs for which we measured titers using our sequencing-based neutralization assay libraries (Figure 5a,b and Supplemental Figure 9a,b). Note that these growth rates estimate how rapidly each strain grows relative to the other strains, rather than the absolute highest frequency reached by each strain “.
(3) For the protection profile against the vaccine strains, the authors find for the adult cohort that the highest titer is always against the oldest vaccine strain tested, which is A/Texas/50/2012. However, the adult sera do not show an increase in titer towards older strains, but only a peak at A/Texas. Therefore, it could be that this is a virus-specific effect, rather than a property of the protection profile. Could the authors test with one older vaccine virus (A/Perth/16/2009?) whether this really can be a general property?
We are interested in studying immune imprinting more thoroughly using sequencing-based neutralization assays, but we note that the adults in the cohorts we studied would have been imprinted with much older strains than included in this library. As this paper focuses on the relative fitness of contemporary strains with minor secondary points regarding imprinting, these experiments are beyond the scope of this study. We’re excited for future work (from our group or others) to explore these points by making a new virus library with strains from multiple decades of influenza evolution.
Reviewer #2 (Public review):
This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, which will be relevant across pathogens (assuming the assay can be appropriately adapted). I only have a few comments, focused on maximising the information provided by the sera.
Thanks very much!
Firstly, one of the major findings is that there is wide heterogeneity in responses across individuals. However, we could expect that individuals' responses should be at least correlated across the viruses considered, especially when individuals are of a similar age. It would be interesting to quantify the correlation in responses as a function of the difference in ages between pairs of individuals. I am also left wondering what the potential drivers of the differences in responses are, with age being presumably key. It would be interesting to explore individual factors associated with responses to specific viruses (beyond simply comparing adults versus children).
We thank the reviewer for this interesting idea. We performed this analysis (and the related analyses described) and added this as a new Supplemental Figure 7, which is pasted after the response to the next related comment by the reviewer.
For 2023-circulating strains, we observed basically no correlation between the strength of correlation between pairs of sera and the difference in age between those pairs of sera (Supplemental Figure 7), which was unsurprising given the high degree of heterogeneity between individual sera (Figure 3, Supplemental Figure 6, and Supplemental Figure 8). For vaccine strains, there is a moderate negative correlation only in the children, but not in the adults or the combined group of adults and children. This could be because the children are younger with limited and potentially more similar vaccine and exposure histories than the adults. It could also be because the children are overall closer in age than the adults.
Relatedly, is the phylogenetic distance between pairs of viruses associated with similarity in responses?
For 2023-circulating strains, across sera cohorts we observed a weak-to-moderate correlation between the strength of correlation between the neutralizing titers across all sera to pairs of viruses and the Hamming distances between virus pairs. For the same comparison with vaccine strains, we observed moderate correlations, but this must be caveated with the slightly larger range of Hamming distances between vaccine strains. Notably, many of the points on the negative correlation slope are a mix of egg- and cell-produced vaccine strains from similar years, but there are some strain comparisons where the same year’s egg- and cell-produced vaccine strains correlate poorly.
Figure 5C is also a really interesting result. To be able to predict growth rates based on titers in the sera is fascinating. As touched upon in the discussion, I suspect it is really dependent on the representativeness of the sera of the population (so, e.g., if only elderly individuals provided sera, it would be a different result than if only children provided samples). It may be interesting to compare different hypotheses - so e.g., see if a population-weighted titer is even better correlated with fitness - so the contribution from each individual's titer is linked to a number of individuals of that age in the population. Alternatively, maybe only the titers in younger individuals are most relevant to fitness, etc.
We’re very interested in these analyses, but suggest they may be better explored in subsequent works that could sample more children, teenagers and adults across age groups. Our sera set, as the reviewer suggests, may be under-powered to perform the proposed analysis on subsetted age groups of our larger age cohorts.
In Figure 6, the authors lump together individuals within 10-year age categories - however, this is potentially throwing away the nuances of what is happening at individual ages, especially for the children, where the measured viruses cross different groups. I realise the numbers are small and the viruses only come from a small numbers of years, however, it may be preferable to order all the individuals by age (y-axis) and the viral responses in ascending order (x-axis) and plot the response as a heatmap. As currently plotted, it is difficult to compare across panels
This is a good suggestion. In the revised manuscript we have included a heatmap of the children and pre-vaccination adults, ordered by the year of birth of each individual, as Supplemental figure 8. That new figure is also pasted in this response.
Reviewer #3 (Public review):
The authors use high-throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. However, there are some areas where I thought the work could be more strongly motivated and linked together. In particular, how the vaccine responses in US and Australia in Figures 6-7 relate to the earlier analysis around growth rates, and what we would expect the relationship between growth rate and population immunity to be based on epidemic theory.
Thank you for this nice summary. This reviewer also notes that the text related to figures 6 and 7 are more secondary to the main story presented in figures 3-5. The main motivation for including figures 6 and 7 were to demonstrate the wide-ranging applications of sequencing-based neutralization data. We have tried to clarify this with the following minor text revisions, which do not add new content but we hope smooth the transition between results sections.
While the preceding analyses demonstrated the utility of sequencing-based neutralization assays for measuring titers of currently circulating strains, our library also included viruses with HAs from each of the H3N2 influenza Northern Hemisphere vaccine strains from the last decade (2014 to 2024, see Supplemental Table 1). These historical vaccine strains cover a much wider span of evolutionary diversity that the 2023-circulating strains analyzed in the preceding sections (Figure 2a,b and Supplemental Figure 2b-e). For this analysis, we focused on the cell-passaged strains for each vaccine, as these are more antigenically similar to their contemporary circulating strains than the egg-passaged vaccine strains since they lack the mutations that arise during growth of viruses in eggs [55–57] (Supplemental Table 1).
Our sequencing-based assay could also be used to assess the impact of vaccination on neutralization titers against the full set of strains in our H3N2 library. To do this, we analyzed matched 28-day post-vaccination samples for each of the above-described 39 pre-vaccination samples from the cohort of adults based in the USA (Table 1). We also analyzed a smaller set of matched pre- and post-vaccination sera samples from a cohort of eight adults based in Australia (Table 1). Note that there are several differences between these cohorts: the USA-based cohort received the 2023-2024 Northern Hemisphere egg-grown vaccine whereas the Australia-based cohort received the 2024 Southern Hemisphere cell-grown vaccine, and most individuals in the USA-based cohort had also been vaccinated in the prior season whereas most individuals in the Australia-based cohort had not. Therefore, multiple factors could contribute to observed differences in vaccine response between the cohorts.
Reviewer #3 (Recommendations for the authors):
Main comments:
(1) The authors compare titres of the pooled sera with the median titres across individual sera, finding a weak correlation (Figure 4). I was therefore interested in the finding that geometric mean titre and median across a study population are well correlated with growth rate (Supplemental Figure 6c). It would be useful to have some more discussion on why estimates from a pool are so much worse than pooled estimates.
We thank this reviewer for this point. We would clarify that pooling sera is the equivalent of taking the arithmetic mean of the individual sera, rather than the geometric mean or median, which tends to bias the measurements of the pool to the outliers within the pool. To address this reviewer’s point, we’ve added the following text to the manuscript:
“To confirm that sera pools are not reflective of the full heterogeneity of their constituent sera, we created equal volume pools of the children and adult sera and measured the titers of these pools using the sequencing-based neutralization assay. As expected, neutralization titers of the pooled sera were always higher than the median across the individual constituent sera, and the pool titers against different viral strains were only modestly correlated with the median titers across individual sera (Figure 4). The differences in titers across strains were also compressed in the serum pools relative to the median across individual sera (Figure 4). The failure of the serum pools to capture the median titers of all the individual sera is especially dramatic for the children sera (Figure 4) because these sera are so heterogeneous in their individual titers (Figure 3b). Taken together, these results show that serum pools do not fully represent individual-level heterogeneity, and are similar to taking the arithmetic mean of the titers for a pool of individuals, which tends to be biased by the highest titer sera”.
(2) Perhaps I missed it, but are growth rates weekly growth rates? (I assume so?)
The growth rates are relative exponential growth rates calculated assuming a serial interval of 3.6 days. We also added clarifying language and a citation for the serial growth interval to the methods section:
The analysis performing H3 HA strain growth rate estimates using the evofr[51] package is at https://github.com/jbloomlab/flu_H3_2023_seqneut_vs_growth. Briefly, we sought to make growth rate estimates for the strains in 2023 since this was the same timeframe when the sera were collected. To achieve this, we downloaded all publicly-available H3N2 sequences from the GISAID[88] EpiFlu database, filtering to only those sequences that closely matched a library HA1 sequence (within one HA1 amino-acid mutation) and were collected between January 2023 and December 2023. If a sequence was within one HA1 amino-acid mutation of multiple library HA1 proteins then it was assigned to the closest one; if there were multiple equally close matches then it was assigned fractionally to each match. We only made growth rate estimates for library strains with at least 80 sequencing counts (Supplemental Figure 9a), and ignored counts for sequences that did not match a library strain (equivalent results were obtained if we instead fit a growth rate for these sequences as an “other” category). We then fit multinomial logistic regression models using the evofr[51] package assuming a serial interval of 3.6 days[101] to the strain counts. For the plot in Figure 5a the frequencies are averaged over a 14-day sliding window for visual clarity, but the fits were to the raw sequencing counts. For most of the analyses in this paper we used models based on requiring 80 sequencing counts to make an estimate for strain growth rates, and counting a sequence as a match if it was within one amino-acid mutation; see https://jbloomlab.github.io/flu_H3_2023_seqneut_vs_growth/ for comparable analyses using different reasonable sequence count cutoffs (e.g., 60, 50, 40 and 30, as depicted in Supplemental Figure 9). Across sequence cutoffs, we found that the fraction of individuals with low neutralization titers and number of HA1 mutations correlated strongly with these MLR-estimated strain growth rates.
(3) I found Figure 3 useful in that it presents phylogenetic structure alongside titres, to make it clearer why certain clusters of strains have a lower response. In contrast, I found it harder to meaningfully interpret Figure 7a beyond the conclusion that vaccines lead to a fairly uniform rise in titre. Do the 275 or 276 mutations that seem important for adults in Figure 3 have any impact?
We are certainly interested in the questions this reviewer raises, and in trying to understand how well a seasonal vaccine protects against the most successful influenza variants that season. However, these post-vaccination sera were taken when neutralizing titers peak ~30 days after vaccination. Because of this, in the larger cohort of US-based post-vaccination adults, the median titers across sera to most strains appear uniformly high. In the Australian-based post-vaccination adults, there was some strain-to-strain variation in median titers across sera, but of course this must be caveated with the much smaller sample size. It might be more relevant to answer this question with longitudinally sampled sera, when titers begin to wane in the following months.
(4) It could be useful to define a mechanistic relationship about how you would expect susceptibility (e.g. fraction with titre < X, where X is a good correlate) to relate to growth via the reproduction number: R = R0 x S. For example, under the assumption the generation interval G is the same for all, we have R = exp(r*G), which would make it possible to make a prediction about how much we would expect the growth rate to change between S = 0.45 and 0.6, as in Fig 5c. This sort of brief calculation (or at least some discussion) could add some more theoretical underpinning to the analysis, and help others build on the work in settings with different fractions with low titres. It would also provide some intuition into whether we would expect relationships to be linear.
This is an interesting idea for future work! However, the scope of our current study is to provide these experimental data and show a correlation with growth; we hope this can be used to build more mechanistic models in future.
(5) A key conclusion from the analysis is that the fraction above a threshold of ~140 is particularly informative for growth rate prediction, so would it be worth including this in Figure 6-7 to give a clearer indication of how much vaccination reduces contribution to strain growth among those who are vaccinated? This could also help link these figures more clearly with the main analysis and question.
Although our data do find ~140 to be the threshold that gives max correlation with growth rate, we are not comfortable strongly concluding 140 is a correlate of protection, as titers could influence viral fitness without completely protecting against infection. In addition, inspection of Figure 5d shows that while ~140 does give the maximal correlation, a good correlation is observed for most cutoffs in the range from ~40 to 200, so we are not sure how robustly we can be sure that ~140 is the optimal threshold.
(6) In Figure 5, the caption doesn't seem to include a description for (e).
Thank you to the reviewer for catching this – this is fixed now.
(7) The US vs Australia comparison could have benefited from more motivation. The authors conclude ,"Due to the multiple differences between cohorts we are unable to confidently ascribe a cause to these differences in magnitude of vaccine response" - given the small sample sizes, what hypotheses could have been tested with these data? The comparison isn't covered in the Discussion, so it seems a bit tangential currently.
Thank you to the reviewer for this comment, but we should clarify our aim was not to directly compare US and Australian adults. We are interested in regional comparisons between serum cohorts, but did not have the numbers to adequately address those questions here. This section (and the preceding question) were indeed both intended to be tangential to the main finding, and hopefully this will be clarified with our text additions in response to Reviewer #3’s public reviews.
eLife Assessment
This is a useful study that examines the relationship between neuropeptide signaling and the precision of vocal motor output using the songbird as a model system. The study presents evidence based on differential expression patterns and genetic or pharmacological inhibition of various neuropeptide genes for a causal role in song performance; however, this evidence is incomplete.
Reviewer #1 (Public review):
Summary:
This study provides evidence that neuropeptide signaling, particularly via the CRH-CRHBP pathway, plays a key role in regulating the precision of vocal motor output in songbirds. By integrating gene expression profiling with targeted manipulations in the song vocal motor nucleus RA, the authors demonstrate that altering CRH and CRHBP levels bidirectionally modulate song variability. These findings reveal a previously unrecognized neuropeptidergic mechanism underlying motor performance control, supported by molecular and functional evidence.
Strengths:
Neural circuit mechanisms underlying motor variability have been intensively studied, yet the molecular bases of such variability remain poorly understood. The authors address this important gap using the songbird (Bengalese finch) as a model system for motor learning, providing experimental evidence that neuropeptide signaling contributes to vocal motor variability. They comprehensively characterize the expression patterns of neuropeptide-related genes in brain regions involved in song vocal learning and production, revealing distinct regulatory profiles compared to non-vocal related regions, as well as developmental, revealing distinct regulatory profiles compared to non-vocal regions, as well as developmental and behavioral dependencies, including altered expression following deafening and correlations with singing activity over the two days preceding sampling. Through these multi-level analyses spanning anatomy, development, and behavior, the authors identify the CRH-CRHBP pathway in the vocal motor nucleus RA as a candidate regulator of song variability. Functional manipulations further demonstrate that modulation of this pathway bidirectionally alters song variability.
Overall, this work represents an effective use of songbirds, though a well-established neuroethological framework uncovers how previously uncharacterized molecular pathways shape behavioral output at the individual level.
Weaknesses:
(1) This study uses Bengalese finches (BFs) for all experiments-bulk RNA-seq, in situ hybridization across developmental stages, deafening, gene manipulation, and CRH microinfusion-except for the sc/snRNA-seq analysis. BFs differ from zebra finches (ZFs) in several important ways, including faster song degradation after deafening and greater syllable sequence complexity. This study makes effective use of these unique BF characteristics and should be commended for doing so.
However, the major concern lies in the use of the single-cell/single-nucleus RNA-seq dataset from Colquitt et al. (2021), which combines data from both ZFs and BFs for cell-type classification. Based on our reanalysis of the publicly available dataset used in both Colquitt et al. (2021) and the present study, my lab identified two major issues:
(a) The first concern is that the quality of the single-cell RNA-seq data from BFs is extremely poor, and the number of BF-derived cells is very limited. In other words, most of the gene expression information at the single-cell (or "subcellular type") level in this study likely reflects ZF rather than BF profiles. In our verification of the authors' publicly annotated data, we found that in the song nucleus RA, only about 18 glutamatergic cells (2.3%) of a total of 787 RA_Glut (RA_Glut1+2+3) cells were derived from BFs. Similarly, in HVC, only 53 cells (4.1%) out of 1,278 Glut1+Glut4 cells were BF-derived. This clearly indicates that the cell-subtype-level expression data discussed in this study are predominantly based on ZF, not BF, expression profiles.
Recent studies have begun to report interspecies differences in the expression of many genes in the song control nuclei. It is therefore highly plausible that the expression patterns of CRHBP and other neuropeptide-signaling-related genes differ between ZFs and BFs. Yet, the current study does not appear to take this potential species difference into account. As a result, analyses such as the CellChat results (Fig. 2F and G) and the model proposed in Fig. 6G are based on ZF-derived transcriptomic information, even though the rest of the experimental data are derived from BF, which raises a critical methodological inconsistency.
(b) The second major concern involves the definition of "subcellular types" in the sc/snRNA-seq dataset. Specifically, the RA_Glut1, 2, and 3 and HVC_Glu1 and 4 clusters-classified as glutamatergic projection neuron subtypes-may in fact represent inter-individual variation within the same cell type rather than true subtypes. Following Colquitt et al. (2021), Toji et al. (PNAS, 2024) demonstrated clear individual differences in the gene expression profiles of glutamatergic projection neurons in RA.
In our reanalysis of the same dataset, we also observed multiple clusters representing the same glutamatergic projection neurons in UMAP space. This likely occurs because Seurat integration (anchor-based mutual nearest neighbor integration) was not applied, and because cells were not classified based on individual SNP information using tools such as Souporcell. When classified by individual SNPs, we confirmed that the RA_Glut1-3 and HVC_Glu1 and 4 clusters correspond simply to cells from different individuals rather than distinct subcellular types. (Although images cannot be attached in this review system, we can provide our analysis results if necessary.)
This distinction is crucial, as subsequent analyses and interpretations throughout the manuscript depend on this classification. In particular, Figure 6G presents a model based on this questionable subcellular classification. Similarly, the ligand-receptor relationships shown in Figure 2G - such as the absence of SST-SSTR1 signaling in RA_Glut3 but its presence in RA_Glut1 and 2-are more plausibly explained by inter-individual variation rather than subcellular-type specificity.
Whether these differences are interpreted as individual variation within a single cell type or as differences in projection targets among glutamatergic neurons has major implications for understanding the biological meaning of neuropeptide-related gene expression in this system.
(2) Based on the important finding that "CRHBP expression in the song motor pathway is correlated with singing," it is necessary to provide data showing that the observed changes in CRHBP and other neuropeptide-related gene expression during the song learning period or after deafening are not merely due to differences in singing amount over the two days preceding brain sampling.
Without such data, the following statement cannot be justified: "Regarding CRHBP expression in the song motor pathway increases during song acquisition and decreases following deafening."
(3) In Figure 5B, the authors should clearly distinguish between intact and deafened birds and show the singing amount for each group. In practice, deafening often leads to a reduction in both the number of song bouts and the total singing time. If, in this experiment, deafened birds also exhibited reduced singing compared to intact birds, then the decreased CRHBP expression observed in HVC and RA (Figures 3 and 4) may not reflect song deterioration, but rather a simple reduction in singing activity.
As a similar viewpoint, the authors report that CRHBP expression levels in RA and HVC increase with age during the song learning period. However, this change may not be directly related to age or the decline in vocal plasticity. Instead, it could correlate with the singing amount during the one to two days preceding brain sampling. The authors should provide data on the singing activity of the birds used for in situ hybridization during the two days prior to sampling.
Reviewer #2 (Public review):
Summary:
The results presented here are a useful extension of two of their previous papers (Colquitt et al 2021, Colquitt et al 2023), where they used single-cell transcriptomics to characterize the inhibitory and excitatory cell types and gene expression patterns of the song circuit, comparing them to mammalian and reptilian brains, and characterized the effect of deafening on these gene expression patterns. In this paper, they focus on the differential expression of various neuropeptidergic systems in the songbird brain. They discover a role for the CRHBP gene in song performance and causally show its influence on song variability.
Strengths:
The authors leverage the advantages of the 'nucleated' structure of the songbird neural circuitry and use a robust approach to compare neuropeptidergic gene expression patterns in these circuits. Their analysis of the expression patterns of the CRHBP gene in different cell types supports their conclusion that interneurons are particularly amenable to this modulation. Their use of a knockdown strategy along with pharmacological manipulation provides strong support for a causal role of neuropeptidergic modulation on song behaviour. These results have important implications as they bring into focus neuropeptide modulation of the song-motor circuit and pave the way for future studies focussing on how this signalling pathway regulates plasticity during song learning and maintenance.
Weaknesses:
While the results demonstrating the bidirectional modulation of CRH and CRHBP on song performance shed light on their role in song plasticity, it would be important to show this in juvenile finches during sensorimotor learning. We also don't get a clear picture of the 'causal' role of this signalling pathway on the song pre-motor area, HVC, as the knockdown and pharmacological manipulation studies were done in RA, whereas we see a modulation of CRHBP expression during deafening and song learning in both RA and HVC. Given the role of interneurons in the HVC in song acquisition (e.g., Vallentin et al. 2016, Science), it would have been interesting to see the results of HVC-specific manipulation of this neuropeptidergic pathway and/or how it affects the song learning process. Perhaps a short discussion of this would help to give the readers some perspective. Finally, a more direct demonstration of the neurophysiological effect of the signalling pathway would also strengthen our understanding of precisely how these modulate the song circuit plasticity, which I understand might be beyond the scope of this study.
Technical/minor:
In the Methods section, several clarifications would be beneficial. For instance, the description of the design matrices would benefit from being presented in a more general statistical form (e.g., linear model equations) rather than using R syntax. This would make the modeling approach more accessible to readers unfamiliar with software-specific syntax. In addition, while some variables (e.g., cdr_scale, frac_mito_scale) are briefly defined, others (e.g., tags, cut3,nsongs_last_two_days_cut3) could be more clearly described. This applies to the descriptions of both the gene set enrichment analysis and the neuropeptide-receptor analysis, which rely heavily on package-specific terminology (e.g., fgseaMultilevel, computeCommunProb), making it difficult for readers to understand the conceptual or statistical basis of the analyses. It would improve clarity if the authors provided a complete list of variable definitions, types (categorical or continuous), and any scaling/transformations applied would enhance clarity and reproducibility.
Reviewer #3 (Public review):
Summary:
The stable production of learned vocalizations like human language and birdsong requires auditory feedback. What happens in the brain areas that generate stable vocalizations as performance deteriorates is not well understood. Using a species of songbird, the current study investigates individual cells within the evolutionarily-conserved brain regions that generate learned vocalizations to describe that the complement of neuropeptide (short proteins) signals may be a key feature of behavioral change. Because neuropeptides are important across species, these findings may help explain diminishing stability in learned behaviors even in humans.
Strengths:
The experiments are solid and follow a strong progression from description through manipulation. The songbird model is appropriate and powerful to inform on generalizable biological mechanisms of precisely learned behaviors, including human speech.
Weaknesses:
While it is always possible to perform more experiments, most of the weaknesses are in the presentation of the project, not in the evidence or analysis, which are leading-edge and appropriate. Generally, the ability to follow the findings and to independently assess rigor would be enhanced with increased explicit mention of the statistical thresholds and subjective descriptions. In addition, two prior pieces of relevant work seem to be omitted, including one performing deafening, gene expression measures, and behavioral assessment in zebra finches, and another describing neuropeptide complements in zebra finch singing nuclei based largely on mass spectrometry. The former in particular should be related to the current findings.
Author response:
We thank the reviewers for their time and their constructive comments.
Reviewer 1 makes several incisive comments about the single-cell RNA-sequencing dataset used in this version of the manuscript, which was previously published in Colquitt, 2021. The Reviewer correctly notes that this dataset consists primarily of nuclei from zebra finches, with a relatively small proportion of the data coming from Bengalese finches. However, all other data presented here comes from assays and experiments in Bengalese finches. This discrepancy could lead to two issues of interpretation. First, there could be substantive expression differences in the CRH signaling pathway between these two species, making it difficult to interpret its cellular expression profile. Second, the Reviewer describes that in their reanalysis of this dataset they determined that what had been described as distinct cell types – namely HVC-Glut-1 vs. HVC-Glut-4 (corresponding to the HVC RA projection neurons) and the three RA-Glut types – are likely to be single cell types. The Reviewer notes that inter-individual differences in gene expression, which were not analyzed in the original publication, could have generated this apparent cell type diversity.
To the first point, we agree that the use of the published dataset that consists primarily of zebra finch data is not ideal when making claims of cell type-specific expression in Bengalese finches. To rectify this issue, we have generated additional sets of snRNA-seq from Bengalese finches that encompass multiple areas of the song system as well as adjacent comparator regions outside of the principal song areas. Our initial analysis of these datasets indicates that the cellular patterns of expression of the CRH system is consistent with what has been presented here. In our revision, we will include a reanalysis of neuropeptide expression using these more extensive datasets.
To the second point, we also agree that some of the instances of glutamatergic neuron diversity could have been generated either by issues stemming from the integration of two species or through interindividual differences. In our analysis of our newer snRNA-seq data, we also identify a single HVC RA projection neuron type (not two) and that RA projection neuron types fall into one or two classes (not three), similar to what Reviewer 1 described. We have deconvolved these datasets by genotype, as suggested by the Reviewer, and do not see substantial interindividual variation across the CRH system. However, our revision will explicitly address these issues.
Reviewer 1 also brings up several important questions concerning the relationships between CRHBP and singing and the challenge of interpreting the influences of song acquisition and deafening on CRHBP expression, given the variation in singing that generally accompanies these changes to song. To address in part this issue, our regression analysis of deafening-associated gene expression differences includes a term for the number of songs sung on the day of euthanasia as well as an interaction term between song destabilization and singing amount. This design controls for the amount that a bird sang in the period before brain collection. This analysis was included in (Colquitt et al., 2023) , and will be further elaborated and discussed in the revised version of this manuscript. Notably, CRHBP expression shows a significant interaction between song destabilization and singing amount, suggesting that reduction of CRHBP following deafening is greater than what would be expected from any reductions in singing alone. This specific analysis will be included in the revised manuscript as well.
However, despite these statistical controls, we cannot fully rule out that singing is playing a fundamental role in driving the CRHBP expression differences we see across conditions. Indeed, a number of studies have described an association between the amount a bird sings and the variability of its song (Chen et al., 2013; Hayase et al., 2018; Hilliard et al., 2012; Miller et al., 2010; Ohgushi et al., 2015) , with a general trend of higher amounts of singing correlated with a reduction in variability. This relationship is consistent with what we see for CRHBP expression in RA and HVC: high in unmanipulated adult males and decreased during states of high variability and plasticity (post-deafening and juveniles). A model that combines these observations, and that we will include in the Discussion of the revised manuscript, is one in which singing induces the expression of CRHBP in RA and HVC, limiting CRH binding to its receptors, thereby limiting this pathway’s proposed effects on the excitability and synaptic plasticity of projection neurons.
Reviewer 2 suggests multiple interesting avenues to more fully characterize the role of the CRH pathway in song performance and learning. First, we agree that HVC is a compelling target to investigate CRH’s role in song, given the similarity of CRHBP expression in HVC and RA across deafening, song acquisition, and singing. As the Reviewer notes, a number of studies have demonstrated key roles for interneurons in shaping neuronal dynamics in HVC and regulating song structure. Here, we focused on RA due to the direct influence of RA projection neurons have on syringeal and respiration motoneurons controlling song production, and the following expectation that manipulations of CRH signaling in this region would have particularly measurable effects on song. However, we agree with the reviewer that it would be of additional interest to investigate manipulations of CRH signalling in HVC. We are considering whether it will be feasible given the usual constraints of time, personnel, and other competing demands to carry such experiments as an addition to the current manuscript. Depending on how that goes, we will either add new experimental data to the manuscript, or simply acknowledge the interest of such experiments in Discussion and defer their pursuit to future study.
Likewise, Reviewer 2 suggests other ways in which an understanding of the role of CRH signalling could be further enriched with additional experiments, including investigating the influence of CRH signaling on song acquisition, when song transitions from a variable and plastic state to a precise and stereotyping state, and pursuing direct evidence that CRH influences the neurophysiology of glutamatergic neurons in HVC or RA. These are both excellent suggestions for ways in neuropeptide signalling could be further linked to alterations in behavior; As we proceed with revisions we will consider whether we can address some of these suggestions within the scope of the current manuscript, versus note them in discussion as directions for future research.
Chen Q, Heston JB, Burkett ZD, White SA. 2013. Expression analysis of the speech-related genes FoxP1 and FoxP2 and their relation to singing behavior in two songbird species. J Exp Biol 216 :3682–3692. doi:10.1242/jeb.085886
Colquitt BM, Li K, Green F, Veline R, Brainard MS. 2023. Neural circuit-wide analysis of changes to gene expression during deafening-induced birdsong destabilization. Elife 12 :e85970. doi:10.7554/eLife.85970
Hayase S, Wang H, Ohgushi E, Kobayashi M, Mori C, Horita H, Mineta K, Liu W-C, Wada K. 2018. Vocal practice regulates singing activity-dependent genes underlying age-independent vocal learning in songbirds. PLoS Biol 16 :e2006537. doi:10.1371/journal.pbio.2006537
Hilliard AT, Miller JE, Fraley ER, Horvath S, White SA. 2012. Molecular microcircuitry underlies functional specification in a basal ganglia circuit dedicated to vocal learning. Neuron 73 :537–552. doi:10.1016/j.neuron.2012.01.005
Miller JE, Hilliard AT, White SA. 2010. Song practice promotes acute vocal variability at a key stage of sensorimotor learning. PLoS One 5 :e8592. doi:10.1371/journal.pone.0008592
Ohgushi E, Mori C, Wada K. 2015. Diurnal oscillation of vocal development associated with clustered singing by juvenile songbirds. J Exp Biol 218 :2260–2268. doi:10.1242/jeb.115105
eLife Assessment
The authors aim to understand why Kupffer cells (KCs) die in metabolic-associated steatotic liver disease (MASLD). This is a useful study using in vitro studies and an in vivo genetic mouse model, suggesting that increased glycolysis contributes to KC death in MASLD. However, the data presented are incomplete as some inconsistencies in the results presented are identified in the characterisation of KCs. This work will be of interest to researchers in the immunology and metabolism fields.
Reviewer #1 (Public review):
Summary:
The authors aim to investigate the mechanisms underlying Kupffer cell death in metabolic-associated steatotic liver disease (MASLD). The authors propose that KCs undergo massive cell death in MASLD and that glycolysis drives this process. However, there appears to be a discrepancy between the reported high rates of KC death and the apparent maintenance of KC homeostasis and replacement capacity.
Strengths:
This is an in vivo study.
Weaknesses:
There are discrepancies between the authors' observations and previous reports, as well as inconsistencies among their own findings.
Before presenting the percentage of CLEC4F⁺TUNEL⁺ cells, the authors should have first shown the number of CLEC4F⁺ cells per unit area in Figure 1. At 16 weeks of age, the proportion of TUNEL⁺ KCs is extremely high (~60%), yet the flow cytometry data indicate that nearly all F4/80⁺ KCs are TIMD4⁺, suggesting an embryonic origin. If such extensive KC death occurred, the proportion of embryonically derived TIMD4⁺ KCs would be expected to decrease substantially. Surprisingly, the proportion of TIMD4⁺ KCs is comparable between chow-fed and 16-week HFHC-fed animals. Thus, the immunostaining and flow cytometry data are inconsistent, making it difficult to explain how massive KC death does not lead to their replacement by monocyte-derived cells.
These data suggest that despite the reported high rate of cell death among CLEC4F⁺TIMD4⁺ KCs, the population appears to self-maintain, with no evidence of monocyte-derived KC generation in this model, which contradicts several recent studies in the field.
Moreover, there is no evidence that TIMD4⁺CLEC4F⁺ KCs increase their proliferation rate to compensate for such extensive cell death. If approximately 60% of KCs are dying and no monocyte-derived KCs are recruited, one would expect a much greater decrease in total KC numbers than what is reported.
It is also unexpected that the maximal rate of KC death occurs at early time points (8 weeks), when the mice have not yet gained substantial weight (Figure 1B). Previous studies have shown that longer feeding periods are typically required to observe the loss of embryo-derived KCs.
Furthermore, it is surprising that the HFD induces as much KC death as the HFHC and MCD diets. Earlier studies suggested that HFD alone is far less effective than MASH-inducing diets at promoting the replacement of embryonic KCs by monocyte-derived macrophages.
In Figure 2D, TIMD4 staining appears extremely faint, making the results difficult to interpret. In contrast, the TUNEL signal is strikingly intense and encompasses a large proportion of liver cells (approximately 60% of KCs, 15% of hepatocytes, 20% of hepatic stellate cells, 30% of non-KC macrophages, and a proportion of endothelial cells is also likely affected). This pattern closely resembles that typically observed in mouse models of acute liver failure. Given this apparent extent of cell death, it is unexpected that ALT and AST levels remain low in MASH mice, which is highly unusual.
No statistical analysis is provided for Figure 5D, and it is unclear which metabolites show statistically significant changes in Figure 5C.
In addition, there is no evaluation of liver pathology in Clec4f-Cre × Chil1flox/flox mice. It remains possible that the observed effects on KC death result from aggravated liver injury in these animals. There is also no evidence that Chil1 deficiency affects glucose metabolism in KCs in vivo.
Finally, the authors should include a more direct experimental approach to modulate glycolysis in KCs and assess its causal role in KC death in MASH.
Reviewer #2 (Public review):
Summary:
In this manuscript, He et al. set out to investigate the mechanisms behind Kupffer Cell death in MASLD. As has been previously shown, they demonstrate a loss of resident KCs in MASLD in different mouse models. They then go on to show that this correlates with alterations in genes/metabolites associated with glucose metabolism in KCs. To investigate the role of glucose metabolism further, they subject isolated KCs in vitro to different metabolic treatments and assess cleaved caspase 3 staining, demonstrating that KCs show increased Cl. Casp 3 staining upon stimulation of glycolysis. Finally, they use a genetic mouse model (Chil1KO) where they have previously reported that loss of this gene leads to increased glycolysis and validate this finding in BMDMs (KO). They then remove this gene specifically from KCs (Clec4fCre) and show that this leads to increased macrophage death compared with controls.
Strengths:
As we do not yet understand why KCs die in MASLD, this manuscript provides some explanation for this finding. The metabolomics is novel and provides insight into KC biology. It could also lead to further investigation; here, it will be important that the full dataset is made available.
Weaknesses:
Different diets are known to induce different amounts of KC loss, yet here, all models examined appear to result in 60% KC death. One small field of view of liver tissue is shown as representative to make these claims, but this is not sufficient, as anything can be claimed based on one field of view. Rather, a full tissue slice should be included to allow readers to really assess the level of death. Additionally, there is no consistency between the markers used to define KCs and moMFs, with CLEC4F being used in microscopy, TIM4 in flow, while the authors themselves acknowledge that moKCs are CLEC4F+TIM4-. As moKCs are induced in MASLD, this limits interpretation. Additionally, Iba1 is referred to as a moMF marker but is also expressed by KCs, which again prevents an accurate interpretation of the data. Indeed, the authors show 60% of KCs are dying but only 30% of IBA1+ moMFs, as KCs are also IBA1+, this would mean that KCs die much more than moMFs, which would then limit the relevance of the BMDM studies performed if the phenotype is KC specific. Therefore, this needs to be clarified. The claim that periportal KCs die preferentially is not supported, given that the majority of KCs are peri-portal. Rather, these results would need to be normalised to KC numbers in PP vs PC regions to make meaningful conclusions. Additionally, KCs are known to be notoriously difficult to keep alive in vitro, and for these studies, the authors only examine cl. Casp 3 staining. To fully understand that data, a full analysis of the viability of the cells and whether they retain the KC phenotype in all conditions is required. Finally, in the Cre-driven KO model, there does not seem to be any death of KCs in the controls (rather numbers trend towards an increase with time on diet, Figure 6E), contrary to what had been claimed in the rest of the paper, again making it difficult to interpret the overall results. Additionally, there is no validation that the increased death observed in vivo in KCs is due to further promotion of glycolysis.
Reviewer #3 (Public review):
This manuscript provides novel insights into altered glucose metabolism and KC status during early MASLD. The authors propose that hyperactivated glycolysis drives a spatially patterned KC depletion that is more pronounced than the loss of hepatocytes or hepatic stellate cells. This concept significantly enhances our understanding of early MASLD progression and KC metabolic phenotype.
Through a combination of TUNEL staining and MS-based metabolomic analyses of KCs from HFHC-fed mice, the authors show increased KC apoptosis alongside dysregulation of glycolysis and the pentose phosphate pathway. Using in vitro culture systems and KC-specific ablation of Chil1, a regulator of glycolytic flux, they further show that elevated glycolysis can promote KC apoptosis.
However, it remains unclear whether the observed metabolic dysregulation directly causes KC death or whether secondary factors, such as low-grade inflammation or macrophage activation, also contribute significantly. Nonetheless, the results, particularly those derived from the Chil1-ablated model, point to a new potential target for the early prevention of KC death during MASLD progression.
The manuscript is clearly written and thoughtfully addresses key limitations in the field, especially the focus on glycolytic intermediates rather than fatty acid oxidation. The authors acknowledge the missing mechanistic link between increased glycolysis and KC death. Still, several interpretations require moderation to avoid overstatement, and certain experimental details, particularly those concerning flow cytometry and population gating, need further clarification.
Strengths:
(1) The study presents the novel observation of profound metabolic dysregulation in KCs during early MASLD and identifies these cells as undergoing apoptosis. The finding that Chil1 ablation aggravates this phenotype opens new avenues for exploring therapeutic strategies to mitigate or reverse MASLD progression.
(2) The authors provide a comprehensive metabolic profile of KCs following HFHC diet exposure, including quantification of individual metabolites. They further delineate alterations in glycolysis and the pentose phosphate pathway in Chil1-deficient cells, substantiating enhanced glycolytic flux through 13C-glucose tracing experiments.
(3) The data underscore the critical importance of maintaining balanced glucose metabolism in both in vitro and in vivo contexts to prevent KC apoptosis, emphasizing the high metabolic specialization of these cells.
(4) The observed increase in KC death in Chil1-deficient KCs demonstrates their dependence on tightly regulated glycolysis, particularly under pathological conditions such as early MASLD.
Weaknesses:
(1) The novelty is questionable. The presented work has considerable overlap with a study by the same lab, which is currently under review (citation 17), and it should be considered whether the data should not be presented in one paper.
(2) The authors report that 60% of KCs are TUNEL-positive after 16 weeks of HFHC diet and confirm this by cleaved caspase-3 staining. Given that such marker positivity typically indicates imminent cell death within hours, it is unexpected that more extensive KC depletion or monocyte infiltration is not observed. Since Timd4 expression on monocyte-derived macrophages takes roughly one month to establish, the authors should consider whether these TUNEL-positive KCs persist in a pre-apoptotic state longer than anticipated. Alternatively, fate-mapping experiments could clarify the dynamics of KC death and replacement.
(3) The mechanistic link between elevated glycolytic flux and KC death remains unclear.
(4) The study does not address the polarization or ontogeny of KCs during early MASLD. Given that pro-inflammatory macrophages preferentially utilize glycolysis, such data could provide valuable insight into the reason for increased KC death beyond the presented hyperreliance on glycolysis.
(5) The gating strategy for monocyte-derived macrophages (moMFs) appears suboptimal and may include monocytes. A more rigorous characterization of myeloid populations by including additional markers would strengthen the study's conclusions.
(6) While BMDMs from Chil1 knockout mice are used to demonstrate enhanced glycolytic flux, it remains unclear whether Chil1 deficiency affects macrophage differentiation itself.
(7) The authors use the PDK activator PS48 and the ATP synthase inhibitor oligomycin to argue that increased glycolytic flux at the expense of OXPHOS promotes KC death. However, given the high energy demands of KCs and the fact that OXPHOS yields 15-16 times more ATP per glucose molecule than glycolysis, the increased apoptosis observed in Figure 4C-F could primarily reflect energy deprivation rather than a glycolysis-specific mechanism.
(8) In Figure 1C, KC numbers are significantly reduced after 4 and 16 weeks of HFHC diet in WT male mice, yet no comparable reduction is seen in Clec4Cre control mice, which should theoretically exhibit similar behavior under identical conditions.
eLife Assessment
This study examines the role of the fungal pathogen Candida albicans in the progression of colorectal cancer, a relevant and urgent topic given the global incidence of colon cancer. While the findings are useful and provide solid experimental work and insight into how Candida may contribute to tumor progression, the small patient sample size, reliance on in vitro models, and absence of in vivo validation may limit its impact. This work will interest scientists studying cancer progression and the role played by pathogens.
Reviewer #1 (Public review):
Summary:
This study addresses the emerging role of fungal pathogens in colorectal cancer and provides mechanistic insights into how Candida albicans may influence tumor-promoting pathways. While the work is potentially impactful and the experiments are carefully executed, the strength of evidence is limited by reliance on in vitro models, small patient sample size, and the absence of in vivo validation, which reduces the translational significance of the findings.
Strengths:
(1) Comprehensive mechanistic dissection of intracellular signaling pathways.
(2) Broad use of pharmacological inhibitors and cell line models.
(3) Inclusion of patient-derived organoids, which increases relevance to human disease.
(4) Focus on an emerging and underexplored aspect of the tumor microenvironment, namely fungal pathogens.
Weaknesses:
(1) Clinical association data are inconsistent and based on very small sample numbers.
(2) No in vivo validation, which limits the translational significance.
(3) Species- and cell type-specificity claims are not well supported by the presented controls.
(4) Reliance on colorectal cancer cell lines alone makes it difficult to judge whether findings are specific or general epithelial responses.
Reviewer #2 (Public review):
The authors in this manuscript studied the role of Candida albicans in Colorectal cancer progression. The authors have undertaken a thorough investigation and used several methods to investigate the role of Candida albicans in Colorectal cancer progression. The topic is highly relevant, given the increasing burden of colon cancer globally and the urgent need for innovative treatment options.
However, there are some inconsistencies in the figures and some missing details in the figures, including:
(1) The authors should clearly explain in the results section which patient samples are shown in Figure 1B.
(2) What do a, ab, b, b written above the bars in Figure 1F represent? Maybe authors should consider removing them, because they create confusion. Also, there is no explanation for those letters in the figure legend.
(3) The authors should submit all the raw images of Western blot with appropriate labels to indicate the bands of protein of interest along with molecular weight markers.
(4) The authors should do the quantification of data in Figure 2d and include it in the figure.
(5) In Figure 2h, the authors should indicate if the quantification represents VEGF expression after 6h or 12h of C. albicans co-culture with cells.
(6) In Figure 2i, quantification of VEGF should be done and data from three independent experiments should be submitted. The authors should also mention the time point.
eLife Assessment
This is a valuable study describing transcriptome-based pheochromocytoma and paraganglioma (PPGL) subtypes and exploring the mutations, immune correlates and disease progression of cases in each subtype. The cohort is a reasonable size and a second cohort is included from the Cancer Genome Atlas (TCGA). One of the key premises of the study is that identification of driver mutations in PPGL is not complete and that compromises characterisation for prognostic purposes. This is a solid starting point on which to base characterisation using different methods.
Reviewer #1 (Public review):
This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathway and inflammation correlates, and disease progression.
The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with mechanistic study.
The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours using a new cohort of n=87 PPGL samples from various locations in the body.
The second section inspects a previously published snRNAseq dataset, assigning the published samples to subtypes C1-C3 using a pseudo-bulk approach.
The tumour samples are obtained from multiple locations in the body, summarised in Fig1A. It will be important to see further investigation of how the sample origin is distributed among the C1-C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.
Comments on revisions:
In SupplFile3 (pdf) - please correct the table format. The contents are obscured due to the narrowness of the table columns.
Deposit the new RNAseq data (N=87 cases, N=5 controls) in an appropriate repository; see "Data on human genotypes and phenotypes" at https://elife-rp.msubmit.net/html/elife-rp_author_instructions.html#dataavailability
Reviewer #2 (Public review):
Summary:
A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods.
Strengths:
The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNAseq on all PPGL samples in clinical practice, some potential proxies are proposed.
Weaknesses:
Performance of some of the proxy markers for transcriptional subtype is not presented.
Limited prognostic information available.
Comments on revisions:
Having reviewed the responses to my comments and associated revisions, I am satisfied that they have been addressed.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public Review):
This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathways and inflammation correlates, and disease progression. The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with a mechanistic study.
The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours. The second section inspects a previously published snRNAseq dataset, and labels some of the published cells as subtypes C1, C2, C3 (Methods could be clarified here), among other cells labelled as immune cell types. Further details about how the previously reported single-nuclei were assigned to the newly described subtypes C1-C3 require clarification.
Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A).
The tumour samples are obtained from multiple locations in the body (Figure 1A). It will be important to see further investigation of how the sample origin is distributed among the C1C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.
Thank you for your valuable suggestion. In the revised manuscript (lines 74-79), Figure. 1A, Table S1 and Supplementary Figure. 1A, we harmonized anatomic site annotations from our PPGL cohort and the TCGA cohort and analyzed the distribution of tumor origin (adrenal vs extra-adrenal) across subtypes. The site composition is essentially uniform across C1-C3— approximately 75% pheochromocytoma (PC) and 25% paraganglioma (PG)—with only minimal variation. Notably, the proportion of extra-adrenal origin (paraganglioma origin) is slightly higher in the C1 subtype (see Supplementary Figure 1A), which aligns with the biological characteristics of tumors from this anatomical site, which typically exhibit more aggressive behavior.
Reviewer #2 (Public Review):
A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods. The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNA-seq on all PPGL samples in clinical practice, some potential proxies are proposed.
We sincerely thank the reviewer for their positive assessment of our study’s rationale. We fully agree that RNA sequencing for all PPGL samples remains resource-intensive in current clinical practice, and its widespread application still faces feasibility challenges. It is precisely for this reason that, after defining transcriptional subtypes, we further focused on identifying and validating practical molecular markers and exploring their detectability at the protein level.
In this study, we validated key markers such as ANGPT2, PCSK1N, and GPX3 using immunohistochemistry (IHC), demonstrating their ability to effectively distinguish among molecular subtypes (see Figure. 5). This provides a potential tool for the clinical translation of transcriptional subtyping, similar to the transcription factor-based subtyping in small cell lung cancer where IHC enables low-cost and rapid molecular classification.
It should be noted that the subtyping performance of these markers has so far been preliminarily validated only in our internal cohort of 87 PPGL samples. We agree with the reviewer that largerscale, multi-center prospective studies are needed in the future to further establish the reliability and prognostic value of these markers in clinical practice.
The performance of some of the proxy markers for transcriptional subtype is not presented.
We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping. In our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.
Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.
There is limited prognostic information available.
Thank you for your valuable suggestion. In this exploratory revision, we present the available prognostic signal in Figure. 5C. Given the current event numbers and follow-up time, we intentionally limited inference. We are continuing longitudinal follow-up of the PPGL cohort and will periodically update and report mature time-to-event analyses in subsequent work.
Reviewer #1 (Recommendations for the authors):
There is no deposition reference for the RNAseq transcriptomics data. Have the data been deposited in a suitable data repository?
Thank you for your valuable suggestion. We have updated the Data availability section (lines 508–511) to clarify that the bulk-tissue RNA-seq datasets generated in this study are available from the corresponding author upon reasonable request.
In the snRNAseq analysis of existing published data, clarify how cells were labelled as "C1", "C2", "C3", alongside cells labelled by cell type (the latter is described briefly in the Methods).
Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A).
Package versions should be included (e.g., CellChat, monocle2).
We greatly appreciate your comments and have now added a dedicated “Software and versions” subsection in Methods. Specifically, we report Seurat (v4.4.0), sctransform (v0.4.2), CellChat (v2.2.0), monocle (v2.36.0; monocle2), pheatmap (v1.0.13), clusterProfiler (v4.16.0), survival (v3.8.3), and ggplot2 (v3.5.2) (lines 514-516). We also corrected a typographical error (“mafools” → “maftools”) (lines 463).
Reviewer #2 (Recommendations for the authors):
It would be helpful to provide a little more detail on the clinical composition of the cohort (e.g., phaeo vs paraganglioma, age, etc.) in the text, acknowledging that this is done in Figure 1.
Thank you for your valuable suggestion. In the revision, we added Table S1 that provides a detailed summary of the clinical composition of the PPGL cohort. Specifically, we report the numbers and proportions (Supplementary Figure. 1A) of pheochromocytoma (PC) versus paraganglioma (PG), further subclassifying PG into head and neck (HN-PG), retroperitoneal (RPPG), and bladder (BC-PG).
How many of each transcriptional subtype had driver mutations (germline or somatic)? This is included in the figures but would be worth mentioning in the text. Presumably, some of these may be present but not detected (e.g., non-coding variants), and this should be commented on. It is feasible that if methods to detect all the relevant genomic markers were improved, then the rate of tumours without driver mutations would be less and their prognostic utility would be more comprehensive.
Thank you for your valuable suggestion. In the revision (lines 113–116), we now report the prevalence of driver mutations (germline or somatic) overall and by transcriptional subtype. We analyzed variant data across 84 PPGL-relevant genes from 179 tumors in the TCGA cohort and 30 tumors in Magnus’s cohort (Fig. 2A; Table S2). High-frequency genes were consistent with known biology—C1 enriched for [e.g., VHL/SDHB], C2 for [e.g., RET/HRAS], and C3 for [e.g., SDHA/SDHD]. We also note that a subset of tumors lacked an identifiable driver, which likely reflects current assay limitations (e.g., non-coding or structural variants, subclonality, and purity effects). Broader genomic profiling (deep WGS/long-read, RNA fusion, methylation) would be expected to reduce the “driver-negative” fraction and further enhance the prognostic utility of these classifiers.
ANGPT2 provides a reasonable predictive capacity for the C1 subtype as defined by the ROC AUC. What was the performance of the PCSK1N and GPX3 as markers of the other subtypes?
We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping, and we have supplemented the analysis with ROC and AUC values for two additional parameters (Author response image 1 , see below). Furthermore, in our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.
Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.
Author response image 1.
Extended Data Figure A-B. (A) The ROC curve illustrates the diagnostic ability to distinguish PCSK1N expression in PPGLs, specifically differentiating subtype C2 from non-C2 subtypes. The red dot indicates the point with the highest sensitivity (93.1%) and specificity (82.8%). AUC, the area under the curve. (B) The ROC curve illustrates the diagnostic ability to distinguish GPX3 expression in PPGLs, specifically differentiating subtype C3 from non-C3 subtypes. The red dot indicates the point with the highest sensitivity (83.0%) and specificity (58.8%). AUC, the area under the curve.
In the discussion, I think it would be valuable to summarise existing clinical/molecular predictors in PPGL and, acknowledging that their performance may be limited, compare them to the potential of these novel classifiers.
Thank you for your valuable suggestion. We have added a concise overview of established clinical and molecular predictors in PPGL and compared them with the potential of our transcriptional classifiers. The new paragraph (Discussion, lines 315–338) now reads:
“Compared to existing clinical and molecular predictors, risk assessment in PPGL has long relied on the following indicators: clinicopathological features (e.g., tumor size, non-adrenal origin, specific secretory phenotype, Ki-67 index), histopathological scoring systems (such as PASS/GAPP), and certain genetic alterations (including high-risk markers like SDHB inactivation mutations, as well as susceptibility gene mutations in ATRX, TERT promoter, MAML3, VHL, NF1, among others). Although these metrics are highly actionable in clinical practice, they exhibit several limitations: first, current molecular markers only cover a subset of patients, and technical constraints hinder the detection of many potentially significant variants (e.g., non-coding mutations), thereby compromising the comprehensiveness of prognostic evaluation; second, histopathological scoring is susceptible to interobserver variability; furthermore, the lack of standardized detection and evaluation protocols across institutions limits the comparability and generalizability of results. Our transcriptomic classification system—comprising C1 (pseudohypoxic/angiogenic signature), C2 (kinase-signaling signature), and C3 (SDHx-related signature)—provides a complementary approach to PPGL risk assessment. These subtypes reflect distinct biological backgrounds tied to specific genetic alterations and can be approximated by measuring the expression of individual genes (e.g., ANGPT2, PCSK1N, or GPX3). This study demonstrates that the classifier offers three major advantages: first, it accurately distinguishes subtypes with coherent biological features; second, it retains significant predictive value even after adjusting for clinical covariates; third, it can be implemented using readily available assays such as immunohistochemistry. These findings suggest that integrating transcriptomic subtyping with conventional clinical markers may offer a more comprehensive and generalizable risk stratification framework. However, this strategy would require validation through multi-center prospective studies and standardization of detection protocols.”
A little more explanation of the principles behind WGCNA would be useful in the methods.
We are grateful for your comments. We have expanded the Methods to briefly explain the principles of WGCNA (lines 426-454). In short, WGCNA constructs a weighted coexpression network from normalized gene expression, identifies modules of tightly co-expressed genes, summarizes each module by its eigengene (the first principal component), and then correlates module eigengenes with phenotypes (e.g., transcriptional subtypes) to highlight biologically meaningful gene sets and candidate hub genes. We now specify our preprocessing, choice of softthresholding power to approximate scale-free topology, module detection/merging criteria, and the statistics used for module–trait association and downstream gene-set scoring.
On line 234, I think the figure should be 5C?
We greatly appreciate your comments and Correct to Figure 5C.
eLife Assessment
This important series of studies provides converging results from complementary neuroimaging and behavioral experiments to identify human brain regions involved in representing regular geometric shapes and their core features. Geometric shape concepts are present across diverse human cultures and possibly involved in human capabilities such as numerical cognition and mathematical reasoning. Identifying the brain networks involved in geometric shape representation is of broad interest to researchers studying human visual perception, reasoning, and cognition. The evidence supporting the presence of representation of geometric shape regularity in dorsal parietal and prefrontal cortex is solid, but does not directly demonstrate that these circuits overlap with those involved in mathematical reasoning. Furthermore, the links to defining features of geometric objects and with mathematical and symbolic reasoning would benefit from stronger evidence from more fine-tuned experimental tasks varying the stimuli and experience.
Reviewer #1 (Public review):
This paper examines how geometric regularities in abstract shapes (e.g., parallelograms, kites) are perceived and processed in the human brain. The manuscript contains multimodal data (behavior, fMRI, MEG) from adults and additional fMRI data from 6-year-old children. The key findings show that (1) processing geometric shapes lead to reduced activity in ventral areas in comparison to complex stimuli and increased activity in intraparietal and inferior temporal regions, (2) the degree of geometric regularity modulates activity in intraparietal and inferior temporal regions, (3) similarity in neural representation of geometric shapes can be captured early by using CNN models and later by models of geometric regularity. In addition to these novel findings, the paper also includes a replication of behavioral data, showing that the perceptual similarity structure amongst the geometric stimuli used can be explained by a combination of visual similarities (as indexed by feedforward CNN model of ventral visual pathway) and geometric features. The paper comes with openly accessible code in a well-documented GitHub repository and the data will be published with the paper on OpenNeuro.
In the revised version of this manuscript, the authors clarified certain aspects of the task design, added critical detail to the description of the methods, and updated the figures to show unsmoothed data and variability across participants. Importantly, the authors thoroughly discussed potential task effects (for the fMRI data only) and added additional analyses that indicate that the effects are unlikely to be driven by linguistic labels/name availability of the stimuli.
Comments on the revision:
Thank you for carefully addressing all my concerns and especially for clarifying the task design.
Reviewer #2 (Public review):
Summary
The current study seeks to understand the neural mechanisms underlying geometric reasoning. Using fMRI with both children and adults, the authors found that contrasting simple geometric shapes with naturalistic images (faces, tools, houses) led to responses in the dorsal visual stream, rather than ventral regions that are generally thought to represent shape properties. The author's followed up on this result using computational modeling and MEG to show that geometric properties explain distinct variance in the neural response than what is captured by a CNN.
Strengths
These findings contribute much-needed neural and developmental data to the ongoing debate regarding shape processing in the brain and offer additional insights into why CNNs may have difficulty with shape processing. The motivation and discussion for the study is appropriately measured, and I appreciate the authors' use of multiple populations, neuroimaging modalities, and computational models in explore this question.
Weaknesses
The presence of activation in aIPS led the authors to interpret their results to mean that geometric reasoning draws on the same processes as mathematical thinking. However, there is only weak and indirect evidence in the current study that geometric reasoning, as its tested here, draws on the same circuits as math.
Reviewer #3 (Public review):
Summary:
The authors report converging evidence from behavioral studies as well as several brain-imaging techniques that geometric figures, notably quadrilaterals, are processed differently in visual (lower activation) and spatial (greater) areas of the human brain than representative figures. Comparison of mathematical models to fit activity for geometric figures shows the best fit for abstract geometric features like parallelism and symmetry. The brain areas active for geometric figures are also active in processing mathematical concepts even in blind mathematicians, linking geometric shapes to abstract math concepts. The effects are stronger in adults than in 6-year-old Western children. Similar phenomena do not appear in great apes, suggesting that this is uniquely human and developmental.
Strengths:
Multiple converging techniques of brain imaging and testing of mathematical models showing special status of perception of abstract forms. Careful reasoning at every step of research and presentation of research, anticipating and addressing possible reservations. Connecting these findings to other findings, brain, behavior, and historical/anthropological to suggest broad and important fundamental connections between abstract visual-spatial forms and mathematical reasoning.
Weaknesses:
I have reservations of the authors' use of "symbolic." They seem to interpret "symbolic" as relying on "discrete, exact, rule-based features." Words are generally considered to symbolic (that is their major function), yet words do not meet those criteria. Depictions of objects can be regarded as symbolic because they represent real objects, they are not the same as the object (as Magritte observed). If so then perhaps depictions of quadrilaterals are also symbolic but then they do not differ from depictions of objects on that quality. Relatedly, calling abstract or generalized representations of forms a distinct "language of thought" doesn't seem supportable by the current findings. Minimally, a language has elements that are combined more or less according to rules. The authors present evidence for geometric forms as elements but nowhere is there evidence for combining them into meaningful strings.
Further thoughts
Incidentally, there have been many attempts at constructing visual languages from visual elements combined by rules, that is, mapping meaning to depictions. Many written languages like Egyptian hieroglyphics or Mayan or Chinese, began that way; there are current attempts using emoji. Apparently, mapping sound to discrete letters, alphabets, is more efficient and was invented once but spread. That said, for restricted domains like maps, circuit diagrams, networks, chemical interactions, mathematics, and more, visual "languages" work quite well.
The findings are striking and as such invite speculation about their meaning and limitations. The images of real objects seem to be interpreted as representations of 3D objects as they activate the same visual areas as real objects. By contrast, the images of 2D geometric forms are not interpreted as representations of real objects but rather seemingly as 2D abstractions. It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e. g., real objects that have simple geometric forms like table tops or boxes under various projections or balls or buildings that are rectangular or triangular. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes; internal features as well as outlines. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features. The feature space of geometry is more than parallelism and symmetry; angles are important for example. Listing and testing features would be fascinating.
Can we say that mathematical thinking began with the regularities of shapes or with counting, or both? External representations of counting go far back into prehistory; tallies are frequent and wide-spread. Infants are sensitive to number across domains as are other primates (and perhaps other species). Finding overlapping brain areas for geometric forms and number is intriguing but doesn't show how they are related.
Categories are established in part by contrast categories; are quadrilaterals and triangles and circles different categories? As for quadrilaterals, the authors say some are "completely irregular." Not really; they are still quadrilaterals, if atypical. See Eleanor Rosch's insightful work on (visual) categories. One wonders about distinguishing squashed quadrilaterals from squashed triangles.
What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to abstracting geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggest that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication, construction or use of objects? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively not qualitatively) and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.
Author response:
The following is the authors’ response to the original reviews
Reviewer #1 (Public review):
Weakness:
I wonder how task difficulty and linguistic labels interact with the current findings. Based on the behavioral data, shapes with more geometric regularities are easier to detect when surrounded by other shapes. Do shape labels that are readily available (e.g., "square") help in making accurate and speedy decisions? Can the sensitivity to geometric regularity in intraparietal and inferior temporal regions be attributed to differences in task difficulty? Similarly, are the MEG oddball detection effects that are modulated by geometric regularity also affected by task difficulty?
We see two aspects to the reviewer’s remarks.
(1) Names for shapes.
On the one hand, is the question of the impact of whether certain shapes have names and others do not in our task. The work presented here is not designed to specifically test the effect of formal western education; however, in previous work (Sablé-Meyer et al., 2021), we noted that the geometric regularity effect remains present even for shapes that do not have specific names, and even in participants who do not have names for them. Thus, we replicated our main effects with both preschoolers and adults that did not attend formal western education and found that our geometric feature model remained predictive of their behavior; we refer the reader to this previous paper for an extensive discussion of the possible role of linguistic labels, and the impact of the statistics of the environment on task performance.
What is more, in our behavior experiments we can discard data from any shape that is has a name in English and run our model comparison again. Doing so diminished the effect size of the geometric feature model, but it remained predictive of human behavior: indeed, if we removed all shapes but kite, rightKite, rustedHinge, hinge and random (i.e., more than half of our data, and shapes for which we came up with names but there are no established names), we nevertheless find that both models significantly correlate with human behavior—see plot in Author response image 1, equivalent of our Fig. 1E with the remaining shapes.
Author response image 1.
An identical analysis on the MEG leads to two noisy but significant clusters (CNN: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008). We have improved our manuscript thanks to the reviewer’s observation by adding a figure with the new behavior analysis to the supplementary figures and in the result section of the behavior task. We now refer to these analysis where appropriate:
(intro) “The effect appeared as a human universal, present in preschoolers, first-graders, and adults without access to formal western math education (the Himba from Namibia), and thus seemingly independent of education and of the existence of linguistic labels for regular shapes.”
(behavior results) “Finally, to separate the effect of name availability and geometric features on behavior, we replicated our analysis after removing the square, rectangle, trapezoids, rhombus and parallelogram from our data (Fig. S5D). This left us with five shapes, and an RDM with 10 entries, When regressing it in a GLM with our two models, we find that both models are still significant predictors (p<.001). The effect size of the geometric feature model is greatly reduced, yet remained significantly higher than that of the neural network model (p<.001).”
(meg results) “This analysis yielded similar clusters when performed on a subset of shapes that do not have an obvious name in English, as was the case for the behavior analysis (CNN Encoding: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008).”
(discussion, end of behavior section) “Previously, we only found such a significant mixture of predictors in uneducated humans (whether French preschoolers or adults from the Himba community, mitigating the possible impact of explicit western education, linguistic labels, and statistics of the environment on geometric shape representation) (Sablé-Meyer et al., 2021).”
Perhaps the referee’s point can also be reversed: we provide a normative theory of geometric shape complexity which has the potential to explain why certain shapes have names: instead of seeing shape names as the cause of their simpler mental representation, we suggest that the converse could occur, i.e. the simpler shapes are the ones that are given names.
(2) Task difficulty
On the other hand is the question of whether our effect is driven by task difficulty. First, we would like to point out that this point could apply to the fMRI task, which asks for an explicit detection of deviants, but does not apply to the MEG experiment. In MEG, participants passively looked at sequences of shapes which, for a given block, comprising many instances of a fixed standard shape and rare deviants–even if they notice deviants, they have no task related to them. Yet two independent findings validated the geometric features model: there was a large effect of geometric regularity on the MEG response to deviants, and the MEG dissimilarity matrix between standard shapes correlated with a model based on geometric features, better than with a model based on CNNs. While the response to rare deviants might perhaps be attributed to “difficulty” (assuming that, in spite of the absence of an explicit task, participants try to spot the deviants and find this self-imposed task more difficult in runs with less regular shapes), it seems very hard to explain the representational similarity analysis (RSA) findings based on difficulty. Indeed, what motivated us to use RSA analysis in both fMRI and MEG was to stop relying on the response to deviants, and use solely the data from standard or “reference” shapes, and model their neural response with theory-derived regressors.
We have updated the manuscript in several places to make our view on these points clearer:
(experiment 4) “This design allowed us to study the neural mechanisms of the geometric regularity effect without confounding effects of task, task difficulty, or eye movements.”
(figure 4, legend) “(A) Task structure: participants passively watch a constant stream of geometric shapes, one per second (presentation time 800ms). The stimuli are presented in blocks of 30 identical shapes up to scaling and rotation, with 4 occasional deviant shape. Participants do not have a task to perform beside fixating.”
Reviewer #2 (Public review):
Weakness:
Given that the primary take away from this study is that geometric shape information is found in the dorsal stream, rather than the ventral stream there is very little there is very little discussion of prior work in this area (for reviews, see Freud et al., 2016; Orban, 2011; Xu, 2018). Indeed, there is extensive evidence of shape processing in the dorsal pathway in human adults (Freud, Culham, et al., 2017; Konen & Kastner, 2008; Romei et al., 2011), children (Freud et al., 2019), patients (Freud, Ganel, et al., 2017), and monkeys (Janssen et al., 2008; Sereno & Maunsell, 1998; Van Dromme et al., 2016), as well as the similarity between models and dorsal shape representations (Ayzenberg & Behrmann, 2022; Han & Sereno, 2022).
We thank the reviewer for this opportunity to clarify our writing. We want to use this opportunity to highlight that our primary finding is not about whether the shapes of objects or animals (in general) are processed in the ventral versus or the dorsal pathway, but rather about the much more restricted domain of geometric shapes such as squares and triangles. We propose that simple geometric shapes afford additional levels of mental representation that rely on their geometric features – on top of the typical visual processing. To the best of our knowledge, this point has not been made in the above papers.
Still, we agree that it is useful to better link our proposal to previous ones. We have updated the discussion section titled “Two Visual Pathways” to include more specific references to the literature that have reported visual object representations in the dorsal pathway. Following another reviewer’s observation, we have also updated our analysis to better demonstrate the overlap in activation evoked by math and by geometry in the IPS, as well as include a novel comparison with independently published results.
Overall, to address this point, we (i) show the overlap between our “geometry” contrast (shape > word+tools+houses) and our “math” contrast (number > words); (ii) we display these ROIs side by side with ROIs found in previous work (Amalric and Dehaene, 2016), and (iii) in each math-related ROIs reported in that article, we test our “geometry” (shape > word+tools+houses) contrast and find almost all of them to be significant in both population; see Fig. S5.
Finally, within the ROIs identified with our geometry localizer, we also performed similarity analyses: for each region we extracted the betas of every voxel for every visual category, and estimated the distance (cross-validated mahalanobis) between different visual categories. In both ventral ROIs, in both populations, numbers were closer to shapes than to the other visual categories including text and Chinese characters (all p<.001). In adults, this result also holds for the right ITG (p=.021) and the left IPS (p=.014) but not the right IPS (p=.17). In children, this result did not hold in the areas.
Naturally, overlap in brain activation does not suffice to conclude that the same computational processes are involved. We have added an explicit caveat about this point. Indeed, throughout the article, we have been careful to frame our results in a way that is appropriate given our evidence, e.g. saying “Those areas are similar to those active during number perception, arithmetic, geometric sequences, and the processing of high-level math concepts” and “The IPS areas activated by geometric shapes overlap with those active during the comprehension of elementary as well as advanced mathematical concepts”. We have rephrased the possibly ambiguous “geometric shapes activated math- and number-related areas, particular the right aIPS.” into “geometric shapes activated areas independently found to be activated by math- and number-related tasks, in particular the right aIPS”.
Reviewer #3 (Public review):
Weakness:
Perhaps the manuscript could emphasize that the areas recruited by geometric figures but not objects are spatial, with reduced processing in visual areas. It also seems important to say that the images of real objects are interpreted as representations of 3D objects, as they activate the same visual areas as real objects. By contrast, the images of geometric forms are not interpreted as representations of real objects but rather perhaps as 2D abstractions.
This is an interesting possibility. Geometric shapes are likely to draw attention to spatial dimensions (e.g. length) and to do so in a 2D spatial frame of reference rather than the 3D representations evoked by most other objects or images. However, this possibility would require further work to be thoroughly evaluated, for instance by comparing usual 3D objects with rare instances of 2D ones (e.g. a sheet of paper, a sticker etc). In the absence of such a test, we refrained from further speculation on this point.
The authors use the term "symbolic." That use of that term could usefully be expanded here.
The reviewer is right in pointing out that “symbolic” should have been more clearly defined. We now added in the introduction:
(introduction) “[…] we sometimes refer to this model as “symbolic” because it relies on discrete, exact, rule-based features rather than continuous representations (Sablé-Meyer et al., 2022). In this representational format, geometric shapes are postulated to be represented by symbolic expressions in a “language-of-thought”, e.g. “a square is a four-sided figure with four equal sides and four right angles” or equivalently by a computer-like program from drawing them in a Logo-like language (Sablé-Meyer et al., 2022).”
Here, however, the present experiments do not directly probe this format of a representation. We have therefore simplified our wording and removed many of our use of the word “symbolic” in favor of the more specific “geometric features”.
Pigeons have remarkable visual systems. According to my fallible memory, Herrnstein investigated visual categories in pigeons. They can recognize individual people from fragments of photos, among other feats. I believe pigeons failed at geometric figures and also at cartoon drawings of things they could recognize in photos. This suggests they did not interpret line drawings of objects as representations of objects.
The comparison of geometric abilities across species is an interesting line of research. In the discussion, we briefly mention several lines of research that indicate that non-human primates do not perceive geometric shapes in the same way as we do – but for space reasons, we are reluctant to expand this section to a broader review of other more distant species. The referee is right that there is evidence of pigeons being able to perceive an invariant abstract 3D geometric shape in spite of much variation in viewpoint (Peissig et al., 2019) – but there does not seem to be evidence that they attend to geometric regularities specifically (e.g. squares versus non-squares). Also, the referee’s point bears on the somewhat different issue of whether humans and other animals may recognize the object depicted by a symbolic drawing (e.g. a sketch of a tree). Again, humans seem to be vastly superior in this domain, and research on this topic is currently ongoing in the lab. However, the point that we are making in the present work is specifically about the neural correlates of the representation of simple geometric shapes which by design were not intended to be interpretable as representations of objects.
Categories are established in part by contrast categories; are quadrilaterals, triangles, and circles different categories?
We are not sure how to interpret the referee’s question, since it bears on the definition of “category” (Spontaneous? After training? With what criterion?). While we are not aware of data that can unambiguously answer the reviewer’s question, categorical perception in geometric shapes can be inferred from early work investigating pop-out effects in visual search, e.g. (Treisman and Gormican, 1988): curvature appears to generate strong pop-out effects, and therefore we would expect e.g. circles to indeed be a different category than, say, triangles. Similarly, right angles, as well as parallel lines, have been found to be perceived categorically (Dillon et al., 2019).
This suggests that indeed squares would be perceived as categorically different from triangles and circles. On the other hand, in our own previous work (Sablé-Meyer et al., 2021) we have found that the deviants that we generated from our quadrilaterals did not pop out from displays of reference quadrilaterals. Pop-out is probably not the proper criterion for defining what a “category” is, but this is the extent to which we can provide an answer to the reviewer’s question.
It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e.g., table tops or cartons under various projections, or balls or buildings that are rectangular or triangular. Building parts, inside and out. like corners. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes, and internal texture. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features.
We agree that there is a whole line of potential research here. We decided to start by focusing on the simplest set of geometric shapes that would give us enough variation in geometric regularity while being easy to match on other visual features. We agree with the reviewer that our results should hold both for more complex 2-D shapes, but also for 3-D shapes. Indeed, generative theories of shapes in higher dimensions following similar principles as ours have been devised (I. Biederman, 1987; Leyton, 2003). We now mention this in the discussion:
“Finally, this research should ultimately be extended to the representation of 3-dimensional geometric shapes, for which similar symbolic generative models have indeed been proposed (Irving Biederman, 1987; Leyton, 2003).”
The feature space of geometry is more than parallelism and symmetry; angles are important, for example. Listing and testing features would be fascinating. Similarly, looking at younger or preferably non-Western children, as Western children are exposed to shapes in play at early ages.
We agree with the reviewer on all point. While we do not list and test the different properties separately in this work, we would like to highlight that angles are part of our geometric feature model, which includes features of “right-angle” and “equal-angles” as suggested by the reviewer.
We also agree about the importance of testing populations with limited exposure to formal training with geometric shapes. This was in fact a core aspect of a previous article of ours which tests both preschoolers, and adults with no access to formal western education – though no non-Western children (Sablé-Meyer et al., 2021). It remains a challenge to perform brain-imaging studies in non-Western populations (although see Dehaene et al., 2010; Pegado et al., 2014).
What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to processing geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggests that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication and construction as well as use? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively, not qualitatively), and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.
We refrained from speculating about this point in the previous version of the article, but share some of the reviewers’ intuitions about the underlying drive for geometric abstraction. As described in (Dehaene, 2026; Sablé-Meyer et al., 2022), our hypothesis, which isn’t tested in the present article, is that the emergence of a pervasive ability to represent aspects of the world as compact expressions in a mental “language-of-thought” is what underlies many domains of specific human competence, including some listed by the reviewer (tool construction, scene understanding) and our domain of study here, geometric shapes.
Recommendations for the Authors:
Reviewer #1 (Recommendations for the authors):
Overall, I enjoyed reading this paper. It is clearly written and nicely showcases the amount of work that has gone into conducting all these experiments and analyzing the data in sophisticated ways. I also thought the figures were great, and I liked the level of organization in the GitHub repository and am looking forward to seeing the shared data on OpenNeuro. I have some specific questions I hope the authors can address.
(1) Behavior
- Looking at Figure 1, it seemed like most shapes are clustering together, whereas square, rectangle, and maybe rhombus and parallelogram are slightly more unique. I was wondering whether the authors could comment on the potential influence of linguistic labels. Is it possible that it is easier to discard the intruder when the shapes are readily nameable versus not?
This is an interesting observation, but the existence of names for shapes does not suffice to explain all of our findings ; see our reply to the public comment.
(2) fMRI
- As mentioned in the public review, I was surprised that the authors went with an intruder task because I would imagine that performance depends on the specific combination of geometric shapes used within a trial. I assume it is much harder to find, for example, a "Right Hinge" embedded within "Hinge" stimuli than a "Right Hinge" amongst "Squares". In addition, the rotation and scaling of each individual item should affect regular shapes less than irregular shapes, creating visual dissimilarities that would presumably make the task harder. Can the authors comment on how we can be sure that the differences we pick up in the parietal areas are not related to task difficulty but are truly related to geometric shape regularities?
Again, please see our public review response for a larger discussion of the impact of task difficulty. There are two aspects to answering this question.
First, the task is not as the reviewer describes: the intruder task is to find a deviant shape within several slightly rotated and scaled versions of the regular shape it came from. During brain imaging, we did not ask participants to find an exemplar of one of our reference shape amidst copies of another, but rather a deviant version of one shape against copies of its reference version. We only used this intruder task with all pairs of shapes to generate the behavioral RSA matrix.
Second, we agree that some of the fMRI effect may stem from task difficulty, and this motivated our use of RSA analysis in fMRI, and a passive MEG task. RSA results cannot be explained by task difficulty.
Overall, we have tried to make the limitations of the fMRI design, and the motivation for turning to passive presentation in MEG, clearer by stating the issues more clearly when we introduce experiment 4:
“The temporal resolution of fMRI does not allow to track the dynamic of mental representations over time. Furthermore, the previous fMRI experiment suffered from several limitations. First, we studied six quadrilaterals only, compared to 11 in our previous behavioral work. Second, we used an explicit intruder detection, which implies that the geometric regularity effect was correlated with task difficulty, and we cannot exclude that this factor alone explains some of the activations in figure 3C (although it is much less clear how task difficulty alone would explain the RSA results in figure 3D). Third, the long display duration, which was necessary for good task performance especially in children, afforded the possibility of eye movements, which were not monitored inside the 3T scanner and again could have affected the activations in figure 3C.”
- How far in the periphery were the stimuli presented? Was eye-tracking data collected for the intruder task? Similar to the point above, I would imagine that a harder trial would result in more eye movements to find the intruder, which could drive some of the differences observed here.
A 1-degree bar was added to Figure 3A, which faithfully illustrates how the stimuli were presented in fMRI. Eye-tracking data was not collected during fMRI. Although the participants were explicitly instructed to fixate at the center of the screen and avoid eye movements, we fully agree with the referee that we cannot exclude that eye movements were present, perhaps more so for more difficult displays, and would therefore have contributed to the observed fMRI activations in experiment 3 (figure 3C). We now mention this limitation explicity at the end of experiment 3. However, crucially, this potential problem cannot apply to the MEG data. During the MEG task, the stimuli were presented one by one at the center of screen, without any explicit task, thus avoiding issues of eye movements. We therefore consider the MEG geometrical regularity effect, which comes at a relatively early latency (starting at ~160 ms) and even in a passive task, to provide the strongest evidence of geometric coding, unaffected by potential eye movement artefacts.
- I was wondering whether the authors would consider showing some un-thresholded maps just to see how widespread the activation of the geometric shapes is across all of the cortex.
We share the uncorrected threshold maps in Fig. S3. for both adults and children in the category localizer, copied here as well. For the geometry task, most of the clusters identified are fairly big and survive cluster-corrected permutations; the uncorrected statistical maps look almost fully identical to the one presented in Fig. 3 (p<.001 map).
- I'm missing some discussion on the role of early visual areas that goes beyond the RSA-CNN comparison. I would imagine that early visual areas are not only engaged due to top-down feedback (line 258) but may actually also encode some of the geometric features, such as parallel lines and symmetry. Is it feasible to look at early visual areas and examine what the similarity structure between different shapes looks like?
If early visual areas encoded the geometric features that we propose, then even early sensor-level RSA matrices should show a strong impact of geometric features similarity, which is not what we find (figure 4D). We do, however, appreciate the referee’s request to examine more closely how this similarity structure looks like. We now provide a movie showing the significant correlation between neural activity and our two models (uncorrected participants); indeed, while the early occipital activity (around 110ms) is dominated by a significant correlation with the CNN model, there are also scattered significant sources associated to the symbolic model around these timepoints already.
To test this further, we used beamformers to reconstruct the source-localized activity in calcarine cortex and performed an RSA analysis across that ROI. We find that indeed the CNN model is strongly significant at t=110ms (t=3.43, df=18, p=.003) while the geometric feature model is not (t=1.04, df=18, p=.31), and the CNN is significantly above the geometric feature model (t=4.25, df=18, p<.001). However, this result is not very stable across time, and there are significant temporal clusters around these timepoints associated to each model, with no significant cluster associated to a CNN > geometric (CNN: significant cluster from 88ms to 140ms, p<.001 in permutation based with 10000 permutations; geometric features has a significant cluster from 80ms to 104ms, p=.0475; no significant cluster on the difference between the two).
(3) MEG
- Similar to the fMRI set, I am a little worried that task difficulty has an effect on the decoding results, as the oddball should pop out more in more geometric shapes, making it easier to detect and easier to decode. Can the authors comment on whether it would matter for the conclusions whether they are decoding varying task difficulty or differences in geometric regularity, or whether they think this can be considered similarly?
See above for an extensive discussion of the task difficulty effect. We point out that there is no task in the MEG data collection part. We have clarified the task design by updating our Fig. 4. Additionally, the fact that oddballs are more perceived more or less easily as a function of their geometric regularity is, in part, exactly the point that we are making – but, in MEG, even in the absence of a task of looking for them.
- The authors discuss that the inflated baseline/onset decoding/regression estimates may occur because the shapes are being repeated within a mini-block, which I think is unlikely given the long ISIs and the fact that the geometric features model is not >0 at onset. I think their second possible explanation, that this may have to do with smoothing, is very possible. In the text, it said that for the non-smoothed result, the CNN encoding correlates with the data from 60ms, which makes a lot more sense. I would like to encourage the authors to provide readers with the unsmoothed beta values instead of the 100-ms smoothed version in the main plot to preserve the reason they chose to use MEG - for high temporal resolution!
We fully agree with the reviewer and have accordingly updated the figures to show the unsmoothed data (see below). Indeed, there is now no significant CNN effect before ~60 ms (up to the accuracy of identifying onsets with our method).
- In Figure 4C, I think it would be useful to either provide error bars or show variability across participants by plotting each participant's beta values. I think it would also be nice to plot the dissimilarity matrices based on the MEG data at select timepoints, just to see what the similarity structure is like.
Following the reviewer’s recommendation, we plot the timeseries with SEM as shaded area, and thicker lines for statistically significant clusters, and we provide the unsmoothed version in figure Fig. 4. As for the dissimilarity matrices at select timepoints, this has now been added to figure Fig. 4.
- To evaluate the source model reconstruction, I think the reader would need a little more detail on how it was done in the main text. How were the lead fields calculated? Which data was used to estimate the sources? How are the models correlated with the source data?
We have imported some of the details in the main text as follows (as well as expanding the methods section a little):
“To understand which brain areas generated these distinct patterns of activations, and probe whether they fit with our previous fMRI results, we performed a source reconstruction of our data. We projected the sensor activity onto each participant's cortical surfaces estimated from T1-images. The projection was performed using eLORETA and emptyroom recordings acquired on the same day to estimate noise covariance, with the default parameters of mne-bids-pipeline. Sources were spaced using a recursively subdivided octahedron (oct5). Group statistics were performed after alignement to fsaverage. We then replicated the RSA analysis […]”
- In addition to fitting the CNN, which is used here to model differences in early visual cortex, have the authors considered looking at their fMRI results and localizing early visual regions, extracting a similarity matrix, and correlating that with the MEG and/or comparing it with the CNN model?
We had ultimately decided against comparing the empirical similarity matrices from the MEG and fMRI experiments, first because the stimuli and tasks are different, and second because this would not be directly relevant to our goal, which is to evaluate whether a geometric-feature model accounts for the data. Thus, we systematically model empirical similarity matrices from fMRI and from MEG with our two models derived from different theories of shape perception in order to test predictions about their spatial and temporal dynamic. As for comparing the similarity matrix from early visual regions in fMRI with that predicted by the CNN model, this is effectively visible from our Fig. 3D where we perform searchlight RSA analysis and modeling with both the CNN and the geometric feature model; bilaterally, we find a correlation with the CNN model, although it sometimes overlap with predictions from the geometric feature model as well. We now include a section explaining this reasoning in appendix:
“Representational similarity analysis also offers a way to directly compared similarity matrices measured in MEG and fMRI, thus allowing for fusion of those two modalities and tentatively assigning a “time stamp” to distinct MRI clusters. However, we did not attempt such an analysis here for several reasons. First, distinct tasks and block structures were used in MEG and fMRI. Second, a smaller list of shapes was used in fMRI, as imposed by the slower modality of acquisition. Third, our study was designed as an attempt to sort out between two models of geometric shape recognition. We therefore focused all analyses on this goal, which could not have been achieved by direct MEG-fMRI fusion, but required correlation with independently obtained model predictions.”
Minor comments
- It's a little unclear from the abstract that there is children's data for fMRI only.
We have reworded the abstract to make this unambiguous
- Figures 4a & b are missing y-labels.
We can see how our labels could be confused with (sub-)plot titles and have moved them to make the interpretation clearer.
- MEG: are the stimuli always shown in the same orientation and size?
They are not, each shape has a random orientation and scaling. On top of a task example at the top of Fig. 4, we have now included a clearer mention of this in the main text when we introduce the task:
“shapes were presented serially, one at a time, with small random changes in rotation and scaling parameters, in miniblocks with a fixed quadrilateral shape and with rare intruders with the bottom right corner shifted by a fixed amount (Sablé-Meyer et al., 2021)”
- To me, the discussion section felt a little lengthy, and I wonder whether it would benefit from being a little more streamlined, focused, and targeted. I found that the structure was a little difficult to follow as it went from describing the result by modality (behavior, fMRI, MEG) back to discussing mostly aspects of the fMRI findings.
We have tried to re-organize and streamline the discussion following these comments.
Then, later on, I found that especially the section on "neurophysiological implementation of geometry" went beyond the focus of the data presented in the paper and was comparatively long and speculative.
We have reexamined the discussion, but the citation of papers emphasizing a representation of non-accidental geometric properties in non-human animals was requested by other commentators on our article; and indeed, we think that they are relevant in the context of our prior suggestion that the composition of geometric features might be a uniquely human feature – these papers suggest that individual features may not, and that it is therefore compositionality which might be special to the human brain. We have nevertheless shortened it.
Furthermore, we think that this section is important because symbolic models are often criticized for lack of a plausible neurophysiological implementation. It is therefore important to discuss whether and how the postulated symbolic geometric code could be realized in neural circuits. We have added this justification to the introduction of this section.
Reviewer #2 (Recommendations for the authors):
(1) If the authors want to specifically claim that their findings align with mathematical reasoning, they could at least show the overlap between the activation maps of the current study and those from prior work.
This was added to the fMRI results. See our answers to the public review.
(2) I wonder if the reason the authors only found aIPS in their first analysis (Figure 2) is because they are contrasting geometric shapes with figures that also have geometric properties. In other words, faces, objects, and houses also contain geometric shape information, and so the authors may have essentially contrasted out other areas that are sensitive to these features. One indication that this may be the case is that the geometric regularity effect and searchlight RSA (Figure 3) contains both anterior and posterior IPS regions (but crucially, little ventral activity). It might be interesting to discuss the implications of these differences.
Indeed, we cannot exclude that the few symmetries, perpendicularity and parallelism cues that can be presented in faces, objects or houses were processed as such, perhaps within the ventral pathway, and that these representations would have been subtracted out. We emphasize that our subtraction isolates the geometrical features that are present in simple regular geometric shapes, over and above those that might exist in other categories. We have added this point to the discussion:
“[… ] For instance, faces possess a plane of quasi-symmetry, and so do many other man-made tools and houses. Thus, our subtraction isolated the geometrical features that are present in simple regular geometric shapes (e.g. parallels, right angles, equality of length) over and above those that might already exist, in a less pure form, in other categories.”
(3) I had a few questions regarding the MEG results.
a. I didn't quite understand the task. What is a regular or oddball shape in this context? It's not clear what is being decoded. Perhaps a small example of the MEG task in Figure 4 would help?
We now include an additional sub-figure in Fig. 4 to explain the paradigm. In brief: there is no explicit task, participants are simply asked to fixate. The shapes come in miniblocks of 30 identical reference shapes (up to rotation and scaling), among which some occasional deviant shapes randomly appear (created by moving the corner of the reference shape by some amount).
b. In Figure 4A/B they describe the correlation with a 'symbolic model'. Is this the same as the geometric model in 4C?
It is. We have removed this ambiguity by calling it “geometric model” and setting its color to the one associated to this model thought the article.
c. The author's explanation for why geometric feature coding was slower than CNN encoding doesn't quite make sense to me. As an explanation, they suggest that previous studies computed "elementary features of location or motor affordance", whereas their study work examines "high-level mathematical information of an abstract nature." However, looking at the studies the authors cite in this section, it seems that these studies also examined the time course of shape processing in the dorsal pathway, not "elementary features of location or motor affordance." Second, it's not clear how the geometric feature model reflects high-level mathematical information (see point above about claiming this is related to math).
We thank the referee for pointing out this inappropriate phrase, which we removed. We rephrased the rest of the paragraph to clarify our hypothesis in the following way:
“However, in this work, we specifically probed the processing of geometric shapes that, if our hypothesis is correct, are represented as mental expressions that combine geometrical and arithmetic features of an abstract categorical nature, for instance representing “four equal sides” or “four right angles”. It seems logical that such expressions, combining number, angle and length information, take more time to be computed than the first wave of feedforward processing within the occipito-temporal visual pathway, and therefore only activate thereafter.”
One explanation may be that the authors' geometric shapes require finer-grained discrimination than the object categories used in prior studies. i.e., the odd-ball task may be more of a fine-grained visual discrimination task. Indeed, it may not be a surprise that one can decode the difference between, say, a hammer and a butterfly faster than two kinds of quadrilaterals.
We do not disagree with this intuition, although note that we do not have data on this point (we are reporting and modelling the MEG RSA matrix across geometric shapes only – in this part, no other shapes such as tools or faces are involved). Still, the difference between squares, rectangles, parallelograms and other geometric shapes in our stimuli is not so subtle. Furthermore, CNNs do make very fine grained distinctions, for instance between many different breeds of dogs in the IMAGENET corpus. Still, those sorts of distinctions capture the initial part of the MEG response, while the geometric model is needed only for the later part. Thus, we think that it is a genuine finding that geometric computations associated with the dorsal parietal pathway are slower than the image analysis performed by the ventral occipito-temporal pathway.
d. CNN encoding at time 0 is a little weird, but the author's explanation, that this is explained by the fact that temporal smoothed using a 100 ms window makes sense. However, smoothing by 100 ms is quite a lot, and it doesn't seem accurate to present continuous time course data when the decoding or RSA result at each time point reflects a 100 ms bin. It may be more accurate to simply show unsmoothed data. I'm less convinced by the explanation about shape prediction.
We agree. Following the reviewer’s advice, as well as the recommendation from reviewer 1, we now display unsmoothed plots, and the effects now exhibit a more reasonable timing (Figure 4D), with effects starting around ~60 ms for CNN encoding.
(4) I appreciate the author's use of multiple models and their explanation for why DINOv2 explains more variance than the geometric and CNN models (that it represents both types of features. A variance partitioning analysis may help strengthen this conclusion (Bonner & Epstein, 2018; Lescroart et al., 2015).
However, one difference between DINOv2 and the CNN used here is that it is trained on a dataset of 142 million images vs. the 1.5 million images used in ImageNet. Thus, DINOv2 is more likely to have been exposed to simple geometric shapes during training, whereas standard ImageNet trained models are not. Indeed, prior work has shown that lesioning line drawing-like images from such datasets drastically impairs the performance of large models (Mayilvahanan et al., 2024). Thus, it is unlikely that the use of a transformer architecture explains the performance of DINOv2. The authors could include an ImageNet-trained transformer (e.g., ViT) and a CNN trained on large datasets (e.g., ResNet trained on the Open Clip dataset) to test these possibilities. However, I think it's also sufficient to discuss visual experience as a possible explanation for the CNN and DINOv2 results. Indeed, young children are exposed to geometric shapes, whereas ImageNet-trained CNNs are not.
We agree with the reviewer’s observation. In fact, new and ongoing work from the lab is also exploring this; we have included in supplementary materials exactly what the reviewer is suggesting, namely the time course of the correlation with ViT and with ConvNeXT. In line with the reviewers’ prediction, these networks, trained on much larger dataset and with many more parameters, can also fit the human data as well as DINOv2. We ran additional analysis of the MEG data with ViT and ConvNeXT, which we now report in Fig. S6 as well as in an additional sentence in that section:
“[…] similar results were obtained by performing the same analysis, not only with another vision transformer network, ViT, but crucially using a much larger convolutional neural network, ConvNeXT, which comprises ~800M parameters and has been trained on 2B images, likely including many geometric shapes and human drawings. For the sake of completeness, RSA analysis in sensor space of the MEG data with these two models is provided in Fig. S6.”
We conclude that the size and nature of the training set could be as important as the architecture – but also note that humans do not rely on such a huge training set. We have updated the text, as well as Fig. S6, accordingly by updating the section now entitled “Vision Transformers and Larger Neural Networks”, and the discussion section on theoretical models.
(5) The authors may be interested in a recent paper from Arcaro and colleagues that showed that the parietal cortex is greatly expanded in humans (including infants) compared to non-human primates (Meyer et al., 2025), which may explain the stronger geometric reasoning abilities of humans.
A very interesting article indeed! We have updated our article to incorporate this reference in the discussion, in the section on visual pathways, as follows:
“Finally, recent work shows that within the visual cortex, the strongest relative difference in growth between human and non-human primates is localized in parietal areas (Meyer et al., 2025). If this expansion reflected the acquisition of new processing abilities in these regions, it might explain the observed differences in geometric abilities between human and non-human primates (Sablé-Meyer et al., 2021).”
Also, the authors may want to include this paper, which uses a similar oddity task and compelling shows that crows are sensitive to geometric regularity:
Schmidbauer, P., Hahn, M., & Nieder, A. (2025). Crows recognize geometric regularity. Science Advances, 11(15), eadt3718. https://doi.org/10.1126/sciadv.adt3718
We have ongoing discussions with the authors of this work and are have prepared a response to their findings (Sablé-Meyer and Dehaene, 2025)–ultimately, we think that this discussion, which we agree is important, does not have its place in the present article. They used a reduced version of our design, with amplified differences in the intruders. While they did not test the fit of their model with CNN or geometric feature models, we did and found that a simple CNN suffices to account for crow behavior. Thus, we disagree that their conclusions follow from their results and their conclusions. But the present article does not seem to be the right platform to engage in this discussion.
References
Ayzenberg, V., & Behrmann, M. (2022). The Dorsal Visual Pathway Represents Object-Centered Spatial Relations for Object Recognition. The Journal of Neuroscience, 42(23), 4693-4710. https://doi.org/10.1523/jneurosci.2257-21.2022
Bonner, M. F., & Epstein, R. A. (2018). Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computational Biology, 14(4), e1006111. https://doi.org/10.1371/journal.pcbi.1006111
Bueti, D., & Walsh, V. (2009). The parietal cortex and the representation of time, space, number and other magnitudes. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1525), 1831-1840.
Dehaene, S., & Brannon, E. (2011). Space, time and number in the brain: Searching for the foundations of mathematical thought. Academic Press.
Freud, E., Culham, J. C., Plaut, D. C., & Bermann, M. (2017). The large-scale organization of shape processing in the ventral and dorsal pathways. eLife, 6, e27576.
Freud, E., Ganel, T., Shelef, I., Hammer, M. D., Avidan, G., & Behrmann, M. (2017). Three-dimensional representations of objects in dorsal cortex are dissociable from those in ventral cortex. Cerebral Cortex, 27(1), 422-434.
Freud, E., Plaut, D. C., & Behrmann, M. (2016). 'What 'is happening in the dorsal visual pathway. Trends in Cognitive Sciences, 20(10), 773-784.
Freud, E., Plaut, D. C., & Behrmann, M. (2019). Protracted developmental trajectory of shape processing along the two visual pathways. Journal of Cognitive Neuroscience, 31(10), 1589-1597.
Han, Z., & Sereno, A. (2022). Modeling the Ventral and Dorsal Cortical Visual Pathways Using Artificial Neural Networks. Neural Computation, 34(1), 138-171. https://doi.org/10.1162/neco_a_01456
Janssen, P., Srivastava, S., Ombelet, S., & Orban, G. A. (2008). Coding of shape and position in macaque lateral intraparietal area. Journal of Neuroscience, 28(26), 6679-6690.
Konen, C. S., & Kastner, S. (2008). Two hierarchically organized neural systems for object information in human visual cortex. Nature Neuroscience, 11(2), 224-231.
Lescroart, M. D., Stansbury, D. E., & Gallant, J. L. (2015). Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Frontiers in Computational Neuroscience, 9(135), 1-20. https://doi.org/10.3389/fncom.2015.00135
Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., & Brendel, W. (2024). In search of forgotten domain generalization. arXiv Preprint arXiv:2410.08258.
Meyer, E. E., Martynek, M., Kastner, S., Livingstone, M. S., & Arcaro, M. J. (2025). Expansion of a conserved architecture drives the evolution of the primate visual cortex. Proceedings of the National Academy of Sciences, 122(3), e2421585122. https://doi.org/10.1073/pnas.2421585122
Orban, G. A. (2011). The extraction of 3D shape in the visual system of human and nonhuman primates. Annual Review of Neuroscience, 34, 361-388.
Romei, V., Driver, J., Schyns, P. G., & Thut, G. (2011). Rhythmic TMS over Parietal Cortex Links Distinct Brain Frequencies to Global versus Local Visual Processing. Current Biology, 21(4), 334-337. https://doi.org/10.1016/j.cub.2011.01.035
Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395(6701), 500-503. https://doi.org/10.1038/26752
Summerfield, C., Luyckx, F., & Sheahan, H. (2020). Structure learning and the posterior parietal cortex. Progress in Neurobiology, 184, 101717. https://doi.org/10.1016/j.pneurobio.2019.101717
Van Dromme, I. C., Premereur, E., Verhoef, B.-E., Vanduffel, W., & Janssen, P. (2016). Posterior Parietal Cortex Drives Inferotemporal Activations During Three-Dimensional Object Vision. PLoS Biology, 14(4), e1002445. https://doi.org/10.1371/journal.pbio.1002445
Xu, Y. (2018). A tale of two visual systems: Invariant and adaptive visual information representations in the primate brain. Annu. Rev. Vis. Sci, 4, 311-336.
Reviewer #3 (Recommendations for the authors):
Bring into the discussion some of the issues outlined above, especially a) the spatial rather than visual of the geometric figures and b) the non-representational aspects of geometric form aspects.
We thank the reviewer for their recommendations – see our response to the public review for more details.
eLife Assessment
The authors present valuable empirical and modelling evidence that statistical learning in speech perception may contain sub-processes. While the evidence for statistical learning effects is solid, the link between the pattern of effects (both empirical and simulated) and the theoretical concepts of the sub-processes (e.g., segmentation, anticipation) could be further developed. This work is of broad interest to researchers working on, or with, statistical learning, and to any researcher interested in the challenges of how data and models adjudicate between competing theoretical constructs.
Reviewer #1 (Public review):
Summary:
This paper presents three experiments. Experiments 1 and 3 use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable length nonsense words (two trisyllabic words and two disyllabic words), and perform the same task. A similar facilitation effect was observed as in Experiment 1. In Experiment 3 (newly added in the Revised manuscript), an adult version of the study by Johnson and Tyler is included. Participants were exposed to streams of words of either uniform length (all disyllabic) or mixed length (two disyllabic, two trisyllabic) and then asked to perform a familiarity judgment on a 1-5 scale on two words from the stream and two part-words. Performance was better in the uniform length condition.
The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task, and find that a bigram model can produce facilitation effects similar to the ones observed by human participants in Experiments 1 and 2 (though this model was not directly applied to test whether human-like effects were also produced to account for the data in Experiment 3). PARSER was also tested and produced differing results from those observed by humans across all three experiments. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.
Strengths:
The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors. The addition of Experiment 3 provides direct evidence that the profile of performance for familiarity ratings and target detection differ as a function of word length variability.
Weaknesses:
(1) The concept of segmentation is still not quite clear. The authors seem to treat the testing procedure of Experiment 3 as synonymous with segmentation. But the ability to more strongly endorse words from the stream versus part-words as familiar does not necessarily mean that they have been successfully "segmented", as I elaborated on in my earlier review. In my view, it would be clearer to refer to segmentation as the mechanism or conceptual construct of segmenting continuous speech into discrete words. This ability to accurately segment component words could support familiarity judgments but is not necessary for above-chance familiarity or recognition judgments, which could be supported by more general memory signals. In other words, segmentation as an underlying ability is sufficient but not necessary for above-chance performance on familiarity-driven measures such as the one used in experiment 3.
(2) The addition of experiment 3 is an added strength of the revised paper and provides more direct evidence of dissociations as a function of word length on the two tasks (target detection and familiarity ratings), compared to the prior strategy of just relying on previous work for this claim. However, it is not clear why the authors chose not to use the same stimuli as used in experiment 1 and 2, which would have allowed for more direct comparisons to be made. It should also be specified whether test items in the UWL and MWL were matched for overall frequency during exposure. Currently, the text does not specify whether test words in the UWL condition were taken from the high frequency or low frequency group; if they were taken from the high frequency group this would of course be a confound when comparing to the MWL condition. Finally, the definition of part-words should also be clarified,
(3) The framing and argument for a prediction/anticipation mechanism was dropped in the Revised manuscript, but there are still a few instances where this framing and interpretation remain. E.g. Abstract - "we found that a prediction mechanism, rather than clustering, could explain the data from target detection." Discussion page 43 "Together, these results suggest that a simple prediction-based mechanism can explain the results from the target detection task, and clustering-based approaches such as PARSER cannot, contrary to previous claims."
Minor (4) It was a bit unclear as to why a conceptual replication of Batterink 2017 was conducted, given that the target syllables at the beginning and end of the streams were immediately dropped from further analysis. Why include syllable targets within these positions in the design if they are not analyzed?
(5) Figures 3 and 4 are plotted on different scales, which makes it difficult to visually compare the effects between word length conditions.
Reviewer #2 (Public review):
Summary:
The valuable study investigates how statistical learning may facilitate a target detection task and whether the facilitation effect is related to statistical learning of word boundaries. Solid evidence is provided that target detection and word segmentation rely on different statistical learning mechanisms.
Strengths:
The study is well designed, using the contrast between the learning of words of uniform length and words of variable length to dissociate general statistical learning effects and effects related to word segmentation.
Weaknesses:
The study relies on the contrast between word length effects on target detection and word learning. However, the study only tested the target detection condition and did not attempt to replicate the word segmentation effect. It is true that the word segmentation effect has been replicated before but it is still worth reviewing the effect size of previous studies.
The paper seems to distinguish prediction, anticipation, and statistical learning, but it is not entirely clear what each terms refers to.
Comments on revisions:
The authors did not address my concerns...they only replied to reviewer 1.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
Summary:
This paper presents two experiments, both of which use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable-length nonsense words (two trisyllabic words and two disyllabic words) and perform the same task. A similar facilitation effect was observed as in Experiment 1. The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task and find that an "anticipation mechanism" can produce facilitation effects, without performing segmentation. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.
Strengths:
The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors.
Weaknesses:
In my view, the main weaknesses of this study relate to the theoretical interpretation of the results.
(1) The key conclusion from these findings is that the facilitation effect observed in the target detection paradigm is driven by a different mechanism (or mechanisms) than those involved in word segmentation. The argument here I think is somewhat unclear and weak, for several reasons:
First, there appears to be some blurring in what exactly is meant by the term "segmentation" with some confusion between segmentation as a concept and segmentation as a paradigm.
Conceptually, segmentation refers to the segmenting of continuous speech into words. However, this conceptual understanding of segmentation (as a theoretical mechanism) is not necessarily what is directly measured by "traditional" studies of statistical learning, which typically (at least in adults) involve exposure to a continuous speech stream followed by a forced-choice recognition task of words versus recombined foil items (part-words or nonwords). To take the example provided by the authors, a participant presented with the sequence GHIABCDEFABCGHI may endorse ABC as being more familiar than BCG, because ABC is presented more frequently together and the learned association between A and B is stronger than between C and G. However, endorsement of ABC over BCG does not necessarily mean that the participant has "segmented" ABC from the speech stream, just as faster reaction times in responding to syllable C versus A do not necessarily indicate successful segmentation. As the authors argue on page 7, "an encounter to a sequence in which two elements co-occur (say, AB) would theoretically allow the learner to use the predictive relationship during a subsequent encounter (that A predicts B)." By the same logic, encoding the relationship between A and B could also allow for the above-chance endorsement of items that contain AB over items containing a weaker relationship.
Both recognition performance and facilitation through target detection reflect different outcomes of statistical learning. While they may reflect different aspects of the learning process and/or dissociable forms of memory, they may best be viewed as measures of statistical learning, rather than mechanisms in and of themselves.
Thanks for this nuanced discussion, and this is an important point that R2 also raised. We agree that segmentation can refer to both an experimental paradigm and a mechanism that accounts for learning in the experimental paradigm. In the experimental paradigm, participants are asked to identify which words they believe to be (whole) words from the continuous syllable stream. In the target-detection experimental paradigm, participants are not asked to identify words from continuous streams, and instead, they respond to the occurrences of a certain syllable. It’s possible that learners employ one mechanism in these two tasks, or that they employ separate mechanisms. It’s also the case that, if all we have is positive evidence for both experimental paradigms, i.e., learners can succeed in segmentation tasks as well as in target detection tasks with different types of sequences, we would have no way of talking about different mechanisms, as you correctly suggested that evidence for segmenting AB and processing B faster following A, is not evidence for different mechanisms.
However, that is not the case. When the syllable sequences contain same-length subsequences (i.e., words), learning is indeed successful in both segmentation and target detection tasks. However, in studies such as Hoch et al. (2013), findings suggest that words from mixed-length sequences are harder to segment than words from uniform-length sequences. This finding exists in adult work (e.g., Hoch et al. 2013) as well as infant work (Johnson & Tyler, 2010), and replicated here in the newly included Experiment 3, which stands in contrast to the positive findings of the facilitation effect with mixed-length sequences in the target detection paradigm (one of our main findings in the paper). Thus, it seems to be difficult to explain, if the learning mechanisms were to be the same, why humans can succeed in mixed-length sequences in target detection (as shown in Experiment 2) but fail in uniform-length sequences (as shown in Hoch et al. and Experiment 3).
In our paper, we have clarified these points describe the separate mechanisms in more detail, in both the Introduction and General Discussion sections.
(2) The key manipulation between experiments 1 and 2 is the length of the words in the syllable sequences, with words either constant in length (experiment 1) or mixed in length (experiment 2). The authors show that similar facilitation levels are observed across this manipulation in the current experiments. By contrast, they argue that previous findings have found that performance is impaired for mixed-length conditions compared to fixed-length conditions. Thus, a central aspect of the theoretical interpretation of the results rests on prior evidence suggesting that statistical learning is impaired in mixed-length conditions. However, it is not clear how strong this prior evidence is. There is only one published paper cited by the authors - the paper by Hoch and colleagues - that supports this conclusion in adults (other mentioned studies are all in infants, which use very different measures of learning). Other papers not cited by the authors do suggest that statistical learning can occur to stimuli of mixed lengths (Thiessen et al., 2005, using infant-directed speech; Frank et al., 2010 in adults). I think this theoretical argument would be much stronger if the dissociation between recognition and facilitation through RTs as a function of word length variability was demonstrated within the same experiment and ideally within the same group of participants.
To summarize the evidence of learning uniform-length and mixed-length sequences (which we discussed in the Introduction section), “even though infants and adults alike have shown success segmenting syllable sequences consisting of words that were uniform in length (i.e., all words were either disyllabic; Graf Estes et al., 2007; or trisyllabic, Aslin et al., 1998), both infants and adults have shown difficulty with syllable sequences consisting of words of mixed length (Johnson & Tyler, 2010; Johnson & Jusczyk, 2003a; 2003b; Hoch et al., 2013).” The newly added Experiment 3 also provided evidence for the difference in uniform-length and mixed-length sequences. Notably, we do not agree with the idea that infant work should be disregarded as evidence just because infants were tested with habituation methods; not only were the original findings (Saffran et al. 1996) based on infant work, so were many other studies on statistical learning.
There are other segmentation studies in the literature that have used mixed-length sequences, which are worth discussing. In short, these studies differ from the Saffran et al. (1996) studies in many important ways, and in our view, these differences explain why the learning was successful. Of interest, Thiessen et al. (2005) that you mentioned was based on infant work with infant methods, and demonstrated the very point we argued for: In their study, infants failed to learn when mixed-length sequences were pronounced as adult-directed speech, and succeeded in learning given infant-directed speech, which contained prosodic cues that were much more pronounced. The fact that infants failed to segment mixed-length sequences without certain prosodic cues is consistent with our claim that mixed-length sequences are difficult to segment in a segmentation paradigm. Another such study is Frank et al. (2010), where continuous sequences were presented in “sentences”. Different numbers of words were concatenated into sentences where a 500ms break was present between each sentence in the training sequence. One sentence contained only one word, or two words, and in the longest sentence, there were 24 words. The results showed that participants are sensitive to the effect of sentence boundaries, which coincide with word boundaries. In the extreme, the one-word-per-sentence condition simply presents learners with segmented word forms. In the 24-word-per-sentence condition, there are nevertheless sentence boundaries that are word boundaries, and knowing these word boundaries alone should allow learners to perform above chance in the test phase. Thus, in our view, this demonstrates that learners can use sentence boundaries to infer word boundaries, which is an interesting finding in its own right, but this does not show that a continuous syllable sequence with mixed word lengths is learnable without additional information. In summary, to our knowledge, syllable sequences containing mixed word lengths are better learned when additional cues to word boundaries are present, and there is strong evidence that syllable sequences containing uniform-word lengths are learned better than mixed-length ones.
Frank, M. C., Goldwater, S., Griffiths, T. L., & Tenenbaum, J. B. (2010). Modeling human performance in statistical word segmentation. Cognition, 117(2), 107-125.
To address your proposal of running more experiments to provide stronger evidence for our theory, we were planning to run another study to have the same group of participants do both the segmentation and target detection paradigm as suggested, but we were unable to do so as we encountered difficulties to run English-speaking participants. Instead, we have included an experiment (now Experiment 3), showing the difference between the learning of uniform-length and mixed-length sequences with the segmentation paradigm that we have never published previously. This experiment provides further evidence for adults’ difficulties in segmenting mixed-length sequences.
(3) The authors argue for an "anticipation" mechanism in explaining the facilitation effect observed in the experiments. The term anticipation would generally be understood to imply some kind of active prediction process, related to generating the representation of an upcoming stimulus prior to its occurrence. However, the computational model proposed by the authors (page 24) does not encode anything related to anticipation per se. While it demonstrates facilitation based on prior occurrences of a stimulus, that facilitation does not necessarily depend on active anticipation of the stimulus. It is not clear that it is necessary to invoke the concept of anticipation to explain the results, or indeed that there is any evidence in the current study for anticipation, as opposed to just general facilitation due to associative learning.
Thanks for raising this point. Indeed, the anticipation effect we reported is indistinguishable from the facilitation effect that we reported in the reported experiments. We have dropped this framing.
In addition, related to the model, given that only bigrams are stored in the model, could the authors clarify how the model is able to account for the additional facilitation at the 3rd position of a trigram compared to the 2nd position?
Thanks for the question. We believe it is an empirical question whether there is an additional facilitation at the 3rd position of a trigram compared to the 2nd position. To investigate this issue, we conducted the following analysis with data from Experiment 1. First, we combined the data from two conditions (exact/conceptual) from Experiment 1 so as to have better statistical power. Next, we ran a mixed effect regression with data from syllable positions 2 and 3 only (i.e., data from syllable position 1 were not included). The fixed effect included the two-way interaction between syllable position and presentation, as well as stream position, and the random effect was a by-subject random intercept and stream position as the random slope. This interaction was significant (χ<sup>2</sup>(3) =11.73, p=0.008), suggesting that there is additional facilitation to the 3rd position compared to the 2nd position.
For the model, here is an explanation of why the model assumes an additional facilitation to the 3rd position. In our model, we proposed a simple recursive relation between the RT of a syllable occurring for the nth time and the n+1<sup>th</sup> time, which is:
and
RT(1) = RT0 + stream_pos * stream_inc, where the n in RT(n) represents the RT for the n<sup>th</sup> presentation of the target syllable, stream_pos is the position (3-46) in the stream, and occurrence is the number of occurrences that the syllable has occurred so far in the stream.
What this means is that the model basically provides an RT value for every syllable in the stream. Thus, for a target at syllable position 1, there is a RT value as an unpredictable target, and for targets at syllable position 2, there is a facilitation effect. For targets at syllable position 3, it is facilitated the same amount. As such, there is an additional facilitation effect for syllable position 3 because effects of predication are recursive.
(4) In the discussion of transitional probabilities (page 31), the authors suggest that "a single exposure does provide information about the transitions within the single exposure, and the probability of B given A can indeed be calculated from a single occurrence of AB." Although this may be technically true in that a calculation for a single exposure is possible from this formula, it is not consistent with the conceptual framework for calculating transitional probabilities, as first introduced by Saffran and colleagues. For example, Saffran et al. (1996, Science) describe that "over a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur across word boundaries. Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low." This makes it clear that the computation of transitional probabilities (i.e., Y | X) is conceptualized to reflect the frequency of XY / frequency of X, over a given language inventory, not just a single pair. Phrased another way, a single exposure to pair AB would not provide a reliable estimate of the raw frequencies with which A and AB occur across a given sample of language.
Thanks for the discussion. We understand your argument, but we respectively disagree that computing transitional probabilities must be conducted under a certain theoretical framework. In our humble opinion, computing transitional probabilities is a mathematical operation, and as such, it is possible to do so with the least amount of data possible that enables the mathematical operation, which concretely is a single exposure during learning. While it is true that a single exposure may not provide a reliable estimate of frequencies or probabilities, it does provide information with which the learner can make decisions.
This is particularly true for topics under discussion regarding the minimal amount of exposure that can enable learning. It is important to distinguish the following two questions: whether learners can learn from a short exposure period (from a single exposure, in fact) and how long of an exposure period does the learner require for it to be considered to produce a reliable estimate of frequencies. Incidentally, given the fact that learners can learn from a single exposure based on Batterink (2017) and the current study, it does not appear that learners require a long exposure period to learn about transitional probabilities.
(5) In experiment 2, the authors argue that there is robust facilitation for trisyllabic and disyllabic words alike. I am not sure about the strength of the evidence for this claim, as it appears that there are some conflicting results relevant to this conclusion. Notably, in the regression model for disyllabic words, the omnibus interaction between word presentation and syllable position did not reach significance (p= 0.089). At face value, this result indicates that there was no significant facilitation for disyllabic words. The additional pairwise comparisons are thus not justified given the lack of omnibus interaction. The finding that there is no significant interaction between word presentation, word position, and word length is taken to support the idea that there is no difference between the two types of words, but could also be due to a lack of power, especially given the p-value (p = 0.010).
Thanks for the comment. Firstly, we believe there is a typo in your comment, where in the last sentence, we believe you were referring to the p-value of 0.103 (source: “The interaction was not significant (χ2(3) = 6.19, p= 0.103”). Yes, a null result with a frequentist approach cannot support a null claim, but Bayesian analyses could potentially provide evidence for the null.
To this end, we conducted a Bayes factor analysis using the approach outlined in Harms and Lakens (2018), which generates a Bayes factor by computing a Bayesian information criterion for a null model and an alternative model. The alternative model contained a three-way interaction of word length, word presentation, and word position, whereas the null model contained a two-way interaction between word presentation and word position as well as a main effect of word length. Thus, the two models only differ in terms of whether there is a three-way interaction. The Bayes factor is then computed as exp[(BICalt − BICnull)/2]. This analysis showed that there is strong evidence for the null, where the Bayes Factor was found to be exp(25.65) which is more than 1011. Thus, there is no power issue here, and there is strong evidence for the null claim that word length did not interact with other factors in Experiment 2.
There is another issue that you mentioned, of whether we should conduct pairwise comparisons if the omnibus interaction did not reach significance. This would be true given the original analysis plan, but we believe that a revised analysis plan makes more sense. In the revised analysis plan for Experiment 2, we start with the three-way interaction (as just described in the last paragraph). The three-way interaction was not significant, and after dropping the third interaction terms, the two-way interaction and the main effect of word length are both significant, and we use this as the overall model. Testing the significance of the omnibus interaction between presentation and syllable position, we found that this was significant (χ<sup>2</sup>(3) =49.77, p<0.001). This represents that, in one model, that the interaction between presentation and syllable position using data from both disyllabic and trisyllabic words. This was in addition to a significant fixed effect of word length (β=0.018, z=6.19, p<0.001). This should motivate the rest of the planned analysis, which regards pairwise comparisons in different word length conditions.
(6) The results plotted in Figure 2 seem to suggest that RTs to the first syllable of a trisyllabic item slow down with additional word presentations, while RTs to the final position speed up. If anything, in this figure, the magnitude of the effect seems to be greater for 1st syllable positions (e.g., the RT difference between presentation 1 and 4 for syllable position 1 seems to be numerically larger than for syllable position 3, Figure 2D). Thus, it was quite surprising to see in the results (p. 16) that RTs for syllable position 1 were not significantly different for presentation 1 vs. the later presentations (but that they were significant for positions 2 and 3 given the same comparison). Is this possibly a power issue? Would there be a significant slowdown to 1st syllables if results from both the exact replication and conceptual replication conditions were combined in the same analysis?
Thanks for the suggestion and your careful visual inspection of the data. After combining the data, the slowdown to 1st syllables is indeed significant. We have reported this in the results of Experiment 1 (with an acknowledgement to this review):
Results showed that later presentations took significantly longer to respond to compared to the first presentation (χ<sup>2</sup>(3) = 10.70, p=0.014), where the effect grew larger with each presentation (second presentation: β=0.011, z=1.82, p=0.069; third presentation: β=0.019, z=2.40, p=0.016; fourth presentation: β=0.034, z=3.23, p=0.001).
(7) It is difficult to evaluate the description of the PARSER simulation on page 36. Perhaps this simulation should be introduced earlier in the methods and results rather than in the discussion only.
Thanks for the suggestions. We have added two separate simulations in the paper, which should describe the PARSER simulations sufficiently, as well as provide further information on the correspondence between the simulations and the experiments. Thanks again for the great review! We believe our paper has improved significantly as a result.
eLife Assessment
This study presents an important finding that ant nest structure and digging behavior depend on ant age demographics for a ground-dwelling ant species (Camponotus fellah). By asking whether ants employ age-polyethism in excavation, the authors address a long-standing question about how individuals in collectives determine the overall state of the task they must perform. The experimental evidence that the age of the ants and the group composition affect the digging of tunnels is convincing, and their model is able to replicate the colony's excavation dynamics qualitatively, results that may prove to be a key consideration for interpreting results from other studies in the field of social insect behavior.
Reviewer #1 (Public review):
This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5, 10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduced parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.
The experimental results are convincing, providing new information and important insights into nest and colony growth in a social insect species. A model, inspired by previous work but modified to capture experimental results, is in reasonable agreement with experiments and is more biologically relevant than previous models.
Reviewer #2 (Public review):
I enjoyed this paper and its examination of the relationship between overall density and age polyethism to reduce the computational complexity required to match nest size with population. I had some questions about the requirement that growth is infinite in such a solution, but these have been addressed by the authors in the responses and updated manuscript. I also enjoyed the discussion of whether collective behaviour is an appropriate framework in systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.
The authors have addressed questions I had in the review process and the manuscripts is now clear in its communication and conclusions.
The modelling approach is compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.
Author response:
The following is the authors’ response to the previous reviews.
Reviewer #1 (Public review):
This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5,10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduced parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.
The experimental results are solid, providing new information and important insights into nest and colony growth in a social insect species. As presented, I still have some reservations about the model's contribution to a deeper understanding of the system. Additional context and explanation of the model, implications, and limitations would be helpful for readers.
We sincerely thank Reviewer #1 for the time and effort dedicated to our manuscript's detailed review and assessment. The new revision suggestions were constructive, and we have provided a point-by-point response to address them.
Reviewer #2 (Public review):
I enjoyed this paper and its examination of the relationship between overall density and age polyethism to reduce the computational complexity required to match nest size with population. I had some questions about the requirement that growth is infinite in such a solution, but these have been addressed by the authors in the responses and the updated manuscript. I also enjoyed the discussion of whether collective behaviour is an appropriate framework in systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.
The authors have addressed questions I had in the review process and the manuscript is now clear in its communication and conclusions.
The modelling approach is compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.
We sincerely thank Reviewer #2 for the time and effort dedicated to our manuscript's detailed review and assessment.
Reviewer #1 (Recommendations for the authors):
Thank you for the modifications. I found much of the additional information very helpful. I do still have a few comments, which I will include below.
We thank the reviewer for this comment
The authors provide some additional citations for the model, however, the ODE in refs 24 and 30 is different from what the authors present here, and different from what is presented in ref 29. Specifically, the additional "volume" term that multiplies the entire equation. Can the authors provide some additional context for their model in comparison to these models as well as how their model relates to other work?
We thank the reviewer for this question. The primary difference between the logistic model (reference number: 24,30), and the saturation model (reference number: 29) is rooted in their assumptions on the scaling of the active number of ants that participate in the nest excavation and the nest volume.
The logistic growth model ( 𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs) describes the excavation in fixed-sized colonies (50, 100, 200) through a balance of two key processes : (1) positive feedback (α𝑉), where the digging efficiency increases with the nest size, and (2) negative feedback (1-V/Vs), where growth slows as the nest approaches a saturation (Vs). The model assumes that the number of actively excavating ants is linearly proportional to the nest volume (V). This represents a scenario where a large nest contains or can support more workers, which in turn increases the digging rates. While this does not require explicit communication between individuals, ants indirectly sense the global nest volume through stigmergic cues, such as pheromone depositions, encounter rates, while ignoring individual differences in age.
In contrast, the saturation model (𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs) assumes a constant number of ants is working throughout the excavation. The digging rate is therefore independent of the nest volume, this model imposes a different cognitive requirement ants must somehow assess the global nest slowing only due to the saturation term (1-V/Vs) as the nest approaches its target size. However, volume (V) and the overall number of ants in the nest. Thus, rather than relying on local cues, ants need more explicit communication or a sophisticated global perception mechanism that allows ants to sense the nest volume and the nest population to adjust the digging rates accordingly. Therefore, this model requires a more complex and less biologically plausible mechanism than the logistic model.
In our age-dependent digging model in the manuscript, we explicitly sum the contribution of each ant towards the nest area expansion based on its age-dependent digging threshold (quantified from fixed demographics experiments) the sum over Thus, the term ‘V’ in the ‘ 𝑉(1-V/Vs) takes the same effect as sum over all ants in the equation (2) of our manuscript; they describe how the total excavation rate scales with the number of individuals. Under the simplifying assumption that the number of ants is proportional to the nest volume ‘V’, and that all ants dig at a constant rate, our equation (2) in the manuscript reduces to the logistic equation ‘𝑉(1-V/Vs)’ This implies that each ant individually assesses the nest volume and then digs at a rate ‘(1-V/Vs)’.
Thus, we adopted the simpler model from the previously published ones, in which ants individually react to the local density cues and regulate their digging. This approach does not require a global assessment of the nest volume or the number of ants; a local perception of density triggers each ant’s decision to dig, likely modulated by the frequency of social contacts or chemical concentration, which serves as an indicator of the global nest area. The ant compares this locally perceived density to an innate, age-specific threshold. If the perceived local density exceeds its threshold (indicating insufficient area), it digs; otherwise, there is no digging. Thus, excavation dynamics in maturing colonies emerge from this collective response to local density cues, without any individual need to directly assess the global nest volume (V) or having explicit knowledge of the colony size (N).
As suggested by the reviewer, we have added these points to the discussion, contrasting the previously published models with our age-dependent excavation models (line numbers: 283-290) “In our study, we adopted the simpler version of previously published age-independent excavation models, where individuals respond to local stigmergic cues such as encounter rates or pheromone concentrations, which serve as a proxy for the global nest volume (24,30). We minimally modified this model to include age-dependent density targets. According to our age-dependent digging model, each ant compares this perceived local density to its own innate age-specific digging threshold as quantified from the fixed demographics experiments. If the perceived local density exceeds its age-dependent area threshold (indicating insufficient area), it digs; otherwise, there is no digging. This mechanism eliminates the need for cognitively demanding global assessment of the total nest volume or the overall colony population, a requirement for the saturation model (29)”.
I still find it a little concerning that the age-independent model, though it cannot be correct, fits the data better than the age-dependent modification. It seems to me the models presented in refs 24, 29, and 30, which served as inspiration for the one presented here, do not have any deep theoretical origin, but were chosen for "being consistent with" the observed overall excavated volumes. Is this correct, and if so, how much can/should be gleaned about behavior from these models? Please provide some discussion of what is reasonable to expect from such a model as well as what the limitations might be.
We thank the reviewer for the comment.
In our study, we make an important assumption, as described in the lines (line number : 161 - 164) of the manuscript, that ants rely on local cues during nest excavation, and individuals cannot distinguish between the fixed demographics and colony maturation conditions. This implies that the age-dependent target area identified in the fixed demographics experiments should also account for the excavation dynamics seen in the colony maturation experiments.
From the fixed demographics young and old experiments, we directly quantified that the younger ants excavate a significantly larger area than the older ants for the same group size. This age-dependent digging propensity is an experimental result, and not a model output.
We agree that the age-independent model fits the colony maturation experiments well, even though it's not a statistically better fit than the age-dependent model. However, the age-independent models in the references (24,29,30) fail to explain the empirically obtained excavation dynamics in the fixed demographics, young and old colonies. If indeed these models were true, then we would have observed similar excavated areas between the colony maturation, fixed demographics, young, and older colonies of the same size. Thus, the inconsistency of these models confirms that age-independent assumptions are biologically inadequate. These details are explicitly mentioned in lines (304 - 309).
We believe that our model’s value is in providing a plausible explanation for the observed excavation dynamics in the colony maturation experiments, and generating testable predictions (Figure 4. C, and 4.D, described in lines 199 - 216) about the percentage contribution of different age cohorts and queens to the excavated area from the colony maturation experiments. This prediction would not be possible with an age-independent model.
Minor comments:
Figure 2A: Please use a color other than white for the model... this curve is still very hard to see
We thank the reviewer for the comment. The colour is changed to yellow.
Figure 4A: Should quoted confidence intervals for slope and intercept be swapped?
Yes, we thank the reviewer for pointing this out. The labels for the slope and intercept were swapped. We corrected this in the current revised version 2.
Figure 5 D-F: Can the authors show data points and confidence intervals instead of bar graphs? The error bars dipping below zero do not clearly represent the data.
We thank the reviewer for the comment. We now show the individual data points from each treatment with the 95% Confidence Interval of the mean.
eLife Assessment
The present manuscript by Cordeiro et al., shows convincing evidence that α-mangostin, a xanthone obtained from the fruit of the Garcinia mangostana tree, behaves as a strong activator of the large-conductance (BK) potassium channels; macroscopic currents and single-channel experiments show that α-mangostin produces an increase in the probability of opening, without affecting the single-channel conductance. The authors put forward that α-mangostin activation of the BK channel is state-independent, and molecular docking and mutagenesis suggest that α-mangostin binds to a site in the internal cavity. Additionally, the authors show that α-mangostin can relax arteries, further suggesting the plausibility of the proposed effects of this compound. These are valuable findings that should be of interest to channel biophysicists and physiologists alike.
Reviewer #1 (Public review):
In this manuscript, the authors aimed to identify the molecular target and mechanism by which α-Mangostin, a xanthone from Garcinia mangostana, produces vasorelaxation that could explain the antihypertensive effects. Building on prior reports of vascular relaxation and ion channel modulation, the authors convincingly show that large-conductance potassium BK channels are the primary site of action. Using electrophysiological, pharmacological, and computational evidence, the authors achieved their aims and showed that BK channels are the critical molecular determinant of mangostin's vasodilatory effects, even though the vascular studies are quite preliminary in nature.
Strengths:
(1) The broad pharmacological profiling of mangostin across potassium channel families, revealing BK channels - and the vascular BK-alpha/beta1 complex - as the potently activated target in a concentration-dependent manner.
(2) Detailed gating analyses showing large negative shifts in voltage-dependence of activation and altered activation and deactivation kinetics.
(3) High-quality single-channel recordings for open probability and dwell times.
(4) Convincing activation in reconstituted BKα/β1-Caᵥ nanodomains mimicking physiological conditions and functional proof-of-concept validation in mouse aortic rings.
Weaknesses are minor:
(1) Some mutagenesis data (e.g., partial loss at L312A) could benefit from complementary structural validation.
(2) While Cav-BK nanodomains were reconstituted, direct measurement of calcium signals after mangostin application onto native smooth muscle could be valuable.
(3) The work has an impact on ion channel physiology and pharmacology, providing a mechanistic link between a natural product and vasodilation. Datasets include electrophysiology traces, mutagenesis scans, docking analyses, and aortic tension recordings. The latter, however, are preliminary in nature.
Reviewer #2 (Public review):
Summary:
In the present manuscript, Cordeiro et al. show that α-mangostin, a xanthone obtained from the fruit of the Garcinia mangostana tree, behaves as an agonist of the BK channels. The authors arrive at this conclusion through the effect of mangostin on macroscopic and single-channel currents elicited by BK channels formed by the α subunit and α + β1sununits, as well as αβ1 channels coexpressed with voltage-dependent Ca2+ (CaV1,2) channels. The single-channel experiments show that α-mangostin produces a robust increase in the probability of opening without affecting the single-channel conductance. The authors contend that α-mangostin activation of the BK channel is state-independent and molecular docking and mutagenesis suggest that α-mangostin binds to a site in the internal cavity. Importantly, α-mangostin (10 μM) alleviates the contracture promoted by noradrenaline. Mangostin is ineffective if the contracted muscles are pretreated with the BK toxin iberiotoxin.
Strengths:
The set of results combining electrophysiological measurements, mutagenesis, and molecular docking reveals α-mangostin as a potent activator of BK channels and the putative location of the α-mangostin binding site. Moreover, experiments conducted on aortic preparations from mice suggest that α-mangostin can aid in developing drugs to treat a myriad of diverse diseases involving the BK channel.
Weaknesses:
Major:
(1) Although the results indicate that α-mangostin is modifying the closed-open equilibrium, the conclusion that this can be due to a stabilization of the voltage sensor in its active configuration may prove to be wrong. It is more probable that, as has been demonstrated for other activators, the α-mangostin is increasing the equilibrium constant that defines the closed-open reaction (L in the Horrigan, Aldrich allosteric gating model for BK). The paper will gain much if the authors determine the probability of opening in a wide range of voltages, to determine how the drug is affecting (or not), the channel voltage dependence, the coupling between the voltage sensor and the pore, and the closed-open equilibrium (L).
(2) Apparently, the molecular docking was performed using the truncated structure of the human BK channel. However, it is unclear which one, since the PDB ID given in the Methods (6vg3), according to what I could find, corresponds to the unliganded, inactive PTK7 kinase domain. Be as it may, the apo and Ca2+ bound structures show that there is a rotation and a displacement of the S6 transmembrane domain. Therefore, the positions of the residues I308, L312, and A316 in the closed and open configurations of the BK channel are not the same. Hence, it is expected that the strength of binding will be different whether the channel is closed or open. This point needs to be discussed.
Minor:
(1) From Figure 3A, it is apparent that the increase in Po is at the expense of the long periods (seconds) that the channel remains closed. One might suggest that α-mangostin increases the burst periods. It would be beneficial if the authors measured both closed and open dwell times to test whether α-mangostin primarily affects the burst periods.
(2) In several places, the authors make similarities in the mode of action of other BK activators and α-mangostin; however, the work of Gessner et al. PNAS 2012 indicates that NS1619 and Cym04 interact with the S6/RCK linker, and Webb et al. demonstrated that GoSlo-SR-5-6 agonist activity is abolished when residues in the S4/S5 linker and in the S6C region are mutated. These findings indicate that binding of the agonist is not near the selectivity filter, as the authors' results suggest that α-mangostin binds.
(3) The sentence starting in line 452 states that there is a pronounced allosteric coupling between the voltage sensors and Ca2+ binding. If the authors are referring to the coupling factor E in the Horrigan-Aldrich gating model, the references cited, in particular, Sun and Horrigan, concluded that the coupling between those sensors is weak.
Reviewer #3 (Public review):
Summary:
This research shows that a-mangostin, a proposed nutraceutical, with cardiovascular protective properties, could act through the activation of large conductance potassium permeable channels (BK). The authors provide convincing electrophysiological evidence that the compound binds to BK channels and induces a potent activation, increasing the magnitude of potassium currents. Since these channels are important modulators of the membrane potential of smooth muscle in vascular tissue, this activation leads to muscle relaxation, possibly explaining cardiovascular protective effects.
Strengths:
The authors present evidence based on several lines of experiments that a-mangostin is a potent activator of BK channels. The quality of the experiments and the analysis is high and represents an appropriate level of analysis. This research is timely and provides a basis to understand the physiological effects of natural compounds with proposed cardio-protective effects.
Weaknesses:
The identification of the binding site is not the strongest point of the manuscript. The authors show that the binding site is probably located in the hydrophobic cavity of the pore and show that point mutations reduce the magnitude of the negative voltage shift of activation produced by a-mangostin. However, these experiments do not demonstrate binding to these sites, and could be explained by allosteric effects on gating induced by the mutations themselves.
Author response:
We sincerely thank the reviewers and editors for their thoughtful evaluations of our work. We are grateful for the careful reading, constructive critiques, and encouraging comments regarding the electrophysiological analyses, mutagenesis, and vascular experiments. The suggestions provided have been very helpful, and we are working to address these points in our revision to strengthen the manuscript and improve its clarity.
In revising the manuscript, we plan to clarify several text passages as recommended by the reviewers, and review and refine the discussion for improved precision. Following the suggestions of the reviewers, we plan to perform a number of additional experiments to provide more data for the binding region and for further mechanistic and physiological insight. We will prepare a point-by-point response addressing all issues raised in a detailed rebuttal. Additionally, we will include improvements in the Methods section as suggested by the SciScore core report.
We appreciate the opportunity to revise our work and thank the reviewers again for their valuable feedback.
eLife Assessment
The one-carbon tetrahydrofolate metabolism plays a crucial role in producing essential metabolic intermediates. In this study, the authors employ a genetics-based approach to demonstrate that three different metabolic pathways are essential for synthesizing 1C-tetrahydrofolates (1C-THF). Disrupting any of these pathways impairs both growth and virulence. Although the work presented is valuable, the experimental evidence remains incomplete without direct quantification of folate intermediates.
Reviewer #1 (Public review):
Summary:
This study identifies three redundant pathways-glycine cleavage system (GCS), serine hydroxymethyltransferase (GlyA), and formate-tetrahydrofolate ligase/FolD-that feed the one-carbon tetrahydrofolate (1C-THF) pool essential for Listeria monocytogenes growth and virulence. Reactivation of the normally inactive fhs gene rescues 1C-THF deficiency, revealing metabolic plasticity and vulnerability for potential antimicrobial targeting
Strengths:
(1) Novel evolutionary insight - reversible reactivation of a pseudogene (fhs) shows adaptive metabolic plasticity, relevant for pathogen evolution.
(2) They systematically combine targeted gene deletions with suppressor screening to dissect the folate/one-carbon network (GCS, GlyA, Fhs/FolD).
Weaknesses:
(1) The study infers 1C-THF depletion mostly genetically and indirectly (growth rescue with adenine) without direct quantification of folate intermediates or fluxes. Biochemical confirmation, LC-MS-based metabolomics of folates/1C donors, or isotopic tracing would strengthen mechanistic claims.
(2) In multiple result sections, the authors report data from technical triplicates but do not mention independent biological replicates (e.g., Figure 2C, Figure 4A-B, Figure 6D). In addition, some results mention statistical significance but without a detailed description of the specific statistical tests used or replicates, such as Figure 2A-C, Figure 2E, and Figure 2G-I.
Reviewer #2 (Public review):
Summary:
The manuscript by Freier et al examines the impact of deletion of the glycine cleavage system (GCS) GcvPAB enzyme complex in the facultative intracellular bacterial pathogen Listeria monocytogenes. GcvPAB mediates the oxidative decarboxylation of glycine as a first step in a pathway that leads to the generation of N5, N10-methylene-Tetrahydrofolate (THF) to replenish the 1-carbon THF (1C-THF) pool. 1C-THF species are important for the biosynthesis of purines and pyrimidines as well as for the formation of serine, methionine, and N-formylmethionine, and the authors have previously demonstrated that gcvPAB is important for bacterial replication within macrophages. A significant defect for growth is observed for the gcvPAB deletion mutant in defined media, and this growth defect appears to stem from the sensitivity of the mutant strain to excess glycine, which is hypothesized to further deplete the 1C-THF pool. Selection of suppressor mutations that restored growth of gcvPAB deletion mutants in synthetic media with high glycine yielded mutants that reversed stop codon inactivation of the formate-tetrahydrofolate ligase (fhs) gene, supporting the premise that generation of N10-formyl-THF can restore growth. Mutations within the folk, codY, and glyA genes, encoding serine hydroxymethyltransferase, were also identified, although the functional impact of these mutations is somewhat less clear. Overall, the authors report that their work identifies three pathways that feed the 1C-THF pool to support the growth and virulence of L. monocytogenes and that this work represents the first example of the spontaneous reactivation of a L. monocytogenes gene that is inactivated by a premature stop codon.
Strengths:
This is an interesting study that takes advantage of a naturally existing fhs mutant Listeria strain to reveal the contributions of different pathways leading to 1C-THF synthesis. The defects observed for the gcvPAB mutant in terms of intracellular growth and virulence are somewhat subtle, indicating that bacteria must be able to access host sources (such as adenine?) to compensate for the loss of purine and fMet synthesis. Overall, the authors do a nice job of assessing the importance of the pathways identified for 1C-THF synthesis.
Weaknesses:
(1) Line 114 and Figure 1: The authors indicate that the gcvPAB deletion forms significantly fewer plaques in addition to forming smaller plaques (although this is a bit hard to see in the plaque images). A reduction in the overall number of plaques sounds like a bacterial invasion defect - has this been carefully assessed? The smaller plaque size makes sense with reduced bacterial replication, but I'm not sure I understand the reduction in plaque number.
(2) Do other Listeria strains contain the stop codon in fhs? How common is this mutation? That would be interesting to know.
(3) Based on the observation that fhs+ ΔgcvPAB ΔglyA mutant is only possible to isolate in complex media, and fhs is responsible for converting formate to 1C-THF with the addition of FolD, have the authors thought of supplementing synthetic media with formate and assessing mutant growth?
Reviewer #3 (Public review):
Summary:
In this study, Freier et al. demonstrate that 3 distinct metabolic pathways are critical for the synthesis of 1C-THF, a metabolite that is crucial for the growth and virulence of Listeria monocytogenes. Using an elegant suppressor screen, they also demonstrate the hierarchical importance of these metabolic pathways with respect to the biosynthesis of 1C-THF.
Strengths:
This study uses elegant bacterial genetics to confirm that 3 distinct metabolic pathways are critical for 1C-THF synthesis in L. monocytogenes, and the lack of either one of these pathways compromises bacterial growth and virulence. The study uses a combination of in vitro growth assays, macrophage-CFU assays, and murine infection models to demonstrate this.
Weaknesses:
(1) The primary finding of the study is that the perturbation of any of the 3 metabolic pathways important for the synthesis of 1C-THF results in reduced growth and virulence of L. monocytogenes. However, there is no evidence demonstrating the levels of 1C-THF in the various knockouts and suppressor mutants used in this study. It is important to measure the levels of this metabolite (ideally using mass spectrometry) in the various knockouts and suppressor mutants, to provide strong causality.
(2) The story becomes a little hard to follow since macrophage-CFU assays and murine infection model data precede the in vitro growth assays. The manuscript would benefit from a reorganization of Figures 2,3, and 4 for better readability and flow of data.
eLife Assessment
The study highlights development of a multiplex coregulator TR-FRET (CRT) assay that detects ligands with theoretical full agonist, partial agonist, antagonist, and inverse agonist signatures within the same chemical series. The findings are valuable and will have theoretical and practical implications in the subfield, with respect to guiding the design of non-lipogenic liver X receptor (LXR) agonists. The strength of the evidence is solid, whereby the methods, data, and analyses broadly support the claims with only minor weaknesses that can be dealt with through improvements in the data analysis and the discussion. This study will be of interest to experts working in the areas of pharmacology, medicinal chemistry, and drug discovery in Alzheimer's diseases and dementias.
Reviewer #1 (Public review):
Summary:
This important study functionally profiled ligands targeting the LXR nuclear receptors using biochemical assays in order to classify ligands according to pharmacological functions. Overall, the evidence is solid, but nuances in the reconstituted biochemical assays and cellular studies and terminology of ligand pharmacology limit the potential impact of the study. This work will be of interest to scientists interested in nuclear receptor pharmacology.
Strengths:
(1) The authors rigorously tested their ligand set in CRTs for several nuclear receptors that could display ligand-dependent cross-talk with LXR cellular signaling and found that all compounds display LXR selectivity when used at ~1 µM.
(2) The authors tested the ligand set for selectivity against two LXR isoforms (alpha and beta). Most compounds were found to be LXRbeta-specific.
(3) The authors performed extensive LXR CRTs, performed correlation analysis to cellular transcription and gene expression, and classification profiling using heatmap analysis-seeking to use relatively easy-to-collect biochemical assays with purified ligand-binding domain (LBD) protein to explain the complex activity of full-length LXR-mediated transcription.
Weaknesses:
(1) The descriptions of some observations lack detail, which limits understanding of some key concepts.
(2) The presence of endogenous NR ligands within cells may confound the correlation of ligand activity of cellular assays to biochemical assay data.
(3) The normalization of biochemical assay data could confound the classification of graded activity ligands.
(4) The presence of >1 coregulator peptide in the biplex (n=2 peptides) CRT (pCRT) format will bias the LBD conformation towards the peptide-bound form with the highest binding affinity, which will impact potency and interpretation of TR-FRET data.
(5) Correlation graphical plots lack sufficient statistical testing.
(6) Some of the proposed ligand pharmacology nomenclature is not clear and deviates from classifications used currently in the field (e.g., hard and soft antagonist; weak vs. partial agonist, definition of an inverse agonist that is not the opposite function to an agonist).
Reviewer #2 (Public review):
Summary:
In this manuscript by Laham and co-workers, the authors profiled structurally diverse LXR ligands via a coregulator TR-FRET (CRT) assay for their ability to recruit coactivators and kick off corepressors, while identifying coregulator preference and LXR isoform selectivity.
The relative ligand potencies measured via CRT for the two LXR isoforms were correlated with ABCA1 induction or lipogenic activation of SRE, depending on cellular contexts (i.e, astrocytoma or hepatocarcinoma cells). While these correlations are interesting, there is some leeway to improve the quantitative presentation of these correlations. Finally, the CRT signatures were correlated with the structural stabilization of the LXR: coregulator complexes. In aggregate, this study curated a set of LXR ligands with disparate agonism signatures that may guide the design of future nonlipogenic LXR agonists with potential therapeutic applications for cardiovascular disease, Alzheimer's, and type 2 diabetes, without inducing mechanisms that promote fat/lipid production.
Strengths:
This study has many strengths, from curating an excellent LXR compound set to the thoughtful design of the CRT and cellular assays. The design of a multiplexed precision CRT (pCRT) assay that detects corepressor displacement as a function of ligand-induced coactivator recruitment is quite impressive, as it allows measurement of ligand potencies to displace corepressors in the presence of coactivators, which cannot be achieved in a regular CRT assay that looks at coactivator recruitment and corepressor dissociation in separate experiments.
Weaknesses:
I did not identify any major weaknesses.
eLife Assessment
This manuscript describes a valuable screening approach to identifying nanobodies with the potential to modulate gene expression via epigenetic regulators. While the concept is of interest and the screening strategy is well designed, the current evidence supporting mechanistic specificity remains incomplete.
Reviewer #1 (Public review):
Summary:
This study presents a high-throughput screening platform to identify nanobodies capable of recruiting chromatin regulators and modulating gene expression. The authors utilize a yeast display system paired with mammalian reporter assays to validate candidate nanobodies, aiming to create a modular resource for synthetic epigenetic control.
Strengths:
(1) The overall screening design combining yeast display with mammalian functional assays is innovative and scalable.
(2) The authors demonstrate proof-of-concept that nanobody-based recruitment can repress or activate reporter expression.
(3) The manuscript contributes to the growing toolkit for epigenome engineering.
Weaknesses:
(1) The manuscript does not investigate which endogenous factors are recruited by the nanobodies. While repression activity is demonstrated at the reporter level, there is no mechanistic insight into what proteins are being brought to the target site by each nanobody. This limits the interpretability and generalizability of the findings. Related to this, Figure S1B reports sequence similarity among complementarity-determining regions (CDRs) of nanobodies that scored highly in the DNMT3A screen. However, it remains unclear whether this similarity reflects convergence on a common molecular target or is coincidental. Without functional or proteomic validation, the relationship between sequence motifs and effector recruitment remains speculative.
(2) The epigenetic consequences of nanobody recruitment are also left unexplored. Despite targeting epigenetic regulators, the study does not assess changes such as DNA methylation or histone modifications. This makes it difficult to interpret whether the observed reporter repression is due to true chromatin remodeling or secondary effects.
Reviewer #2 (Public review):
Summary:
Wan, Thurm et al. use a yeast nanobody library that is thought to have diverse binders to isolate those that specifically bind to proteins of their interest. The yeast nanobody library collection in general carries enormous potential, but the challenge is to isolate binders that have specific activity. The authors posit that one reason for this isolation challenge is that the negative binders, in general, dampen the signal from the positive binders. This is a classic screening problem (one that geneticists have faced over decades) and, in general, underscores the value of developing a good secondary screen. Over many years, the authors have developed an elegant platform to carry out high-throughput silencing-based assays, thus creating the perfect secondary screen platform to isolate nanobodies that bind to chromatin regulators.
Strengths:
Highlights the enormous value of a strong secondary screen when identifying binders that can be isolated from the yeast nanobody library. This insight is generalizable, and I expect that this manuscript should help inspire many others to design such approaches.
Provides new cell-based reagents that can be used to recruit epigenetic activators or repressors to modulate gene expression at target loci.
Weaknesses:
The authors isolate DNMT3A and TET1/2 enzymes directly from cell lysates and bind these proteins to beads. It is not clear what proteins are, in fact, bound to beads at the end of the IP. Epigenetic repressors are part of complexes, and it would be helpful to know if the IP is specific and whether the IP pulls down only DNMT3A or other factors. While this does not change the underlying assumptions about the screen, it does alter the authors' conclusions about whether the nanobody exclusively recruits DNMT3A or potentially binds to other co-factors.
Using IP-MS to validate the pull-down would be a helpful addition to the manuscript, although one could very reasonably make the case that other co-factors get washed away during the course of the selection assay. Nevertheless, if there are co-factors that are structural and remain bound, these are likely to show up in the MS experiment.
eLife Assessment
This important study reports on the relationships between cerebral haemodynamics and a number of factors that relate to genetics, lifestyle, and medical history using data from a large cohort. Compelling evidence suggests that brief arterial spin labelling MRI acquisition can lead to both expected observations about brain health, as manifested in cerebral blood flow, and biomarkers for use in diagnosis and treatment monitoring. The results can be used as a starting point for hypothesis generation and further evaluation of conditions expected to affect haemodynamics in the brain.
Reviewer #1 (Public review):
Summary:
In this work, Okell et al. describe the imaging protocol and analysis pipeline pertaining to the arterial spin labeling (ASL) MRI protocol acquired as part of the UK Biobank imaging study. In addition, they present preliminary analyses of the first 7000+ subjects in whom ASL data were acquired, and this represents the largest such study to date. Careful analyses revealed expected associations between ASL-based measures of cerebral hemodynamics and non-imaging-based markers, including heart and brain health, cognitive function, and lifestyle factors. As it measures physiology and not structure, ASL-based measures may be more sensitive to these factors compared with other imaging-based approaches.
Strengths:
This study represents the largest MRI study to date to include ASL data in a wide age range of adult participants. The ability to derive arterial transit time (ATT) information in addition to cerebral blood flow (CBF) is a considerable strength, as many studies focus only on the latter.
Some of the results (e.g., relationships with cardiac output and hypertension) are known and expected, while others (e.g., lower CBF and longer ATT correlating with hearing difficulty in auditory processing regions) are more novel and intriguing. Overall, the authors present very interesting physiological results, and the analyses are conducted and presented in a methodical manner.
The analyses regarding ATT distributions and the potential implications for selecting post-labeling delays (PLD) for single PLD ASL are highly relevant and well-presented.
Weaknesses:
At a total scan duration of 2 minutes, the ASL sequence utilized in this cohort is much shorter than that of a typical ASL sequence (closer to 5 minutes as mentioned by the authors). However, this implementation also included multiple (n=5) PLDs. As currently described, it is unclear how any repetitions were acquired at each PLD and whether these were acquired efficiently (i.e., with a Look-Locker readout) or whether individual repetitions within this acquisition were dedicated to a single PLD. If the latter, the number of repetitions per PLD (and consequently signal-to-noise-ratio, SNR) is likely to be very low. Have the authors performed any analyses to determine whether the signal in individual subjects generally lies above the noise threshold? This is particularly relevant for white matter, which is the focus of several findings discussed in the study.
Hematocrit is one of the variables regressed out in order to reduce the effect of potential confounding factors on the image-derived phenotypes. The effect of this, however, may be more complex than accounting for other factors (such as age and sex). The authors acknowledge that hematocrit influences ASL signal through its effect on longitudinal blood relaxation rates. However, it is unclear how the authors handled the fact that the longitudinal relaxation of blood (T1Blood) is explicitly needed in the kinetic model for deriving CBF from the ASL data. In addition, while it may reduce false positives related to the relationships between dietary factors and hematocrit, it could also mask the effects of anemia present in the cohort. The concern, therefore, is two-fold: (1) Were individual hematocrit values used to compute T1Blood values? (2) What effect would the deconfounding process have on this?
The authors leverage an observed inverse association between white matter hyperintensity volume and CBF as evidence that white matter perfusion can be sensitively measured using the imaging protocol utilized in this cohort. The relationship between white matter hyperintensities and perfusion, however, is not yet fully understood, and there is disagreement regarding whether this structural imaging marker necessarily represents impaired perfusion. Therefore, it may not be appropriate to use this finding as support for validation of the methodology.
Reviewer #2 (Public review):
Summary:
Okell et al. report the incorporation of arterial spin-labeled (ASL) perfusion MRI into the UK Biobank study and preliminary observations of perfusion MRI correlates from over 7000 acquired datasets, which is the largest sample of human perfusion imaging data to date. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.
ASL MRI noninvasively quantifies regional cerebral blood flow (CBF), which reflects both cerebrovascular integrity and neural activity, hence serves as a measure of brain function and a potential biomarker for a variety of CNS disorders. Despite a highly abbreviated ASL MRI protocol, significant correlations with both expected and novel demographic, physiological, and medical factors are demonstrated. In many such cases, ASL was also more sensitive than other MRI-derived metrics. The ASL MRI protocol implemented also enables quantification of arterial transit time (ATT), which provides stronger clinical correlations than CBF in some factors. The results demonstrate both the feasibility and the efficacy of ASL MRI in the UK Biobank imaging study, which expects to complete ASL MRI in up to 60,000 richly phenotyped individuals. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.
Strengths:
A key strength of this study is the use of an ASL MRI protocol incorporating balanced pseudocontinuous labeling with a background-suppressed 3D readout, which is the current state-of-the-art. To compensate for the short scan time, voxel resolution was intentionally only moderate. The authors also elected to acquire these data across five post-labeling delays, enabling ATT and ATT-corrected CBF to be derived using the BASIL toolbox, which is based on a variational Bayesian framework. The resulting CBF and ATT maps shown in Figure 1 are quite good, especially when combined with such a large and deeply phenotyped sample.
Another strength of the study is the rigorous image analysis approach, which included covariation for a number of known CBF confounds as well as correction for motion and scanner effects. In doing so, the authors were able to confirm expected effects of age, sex, hematocrit, and time of day on CBF values. These observations lend confidence in the veracity of novel observations, for example, significant correlations between regional ASL parameters and cardiovascular function, height, alcohol consumption, depression, and hearing, as well as with other MRI features such as regional diffusion properties and magnetic susceptibility. They also provide valuable observations about ATT and CBF distributions across a large cohort of middle-aged and older adults.
Weaknesses:
This study primarily serves to illustrate the efficacy and potential of ASL MRI as an imaging parameter in the UK Biobank study, but some of the preliminary observations will be hypothesis-generating for future analyses in larger sample sizes. However, a weakness of the manuscript is that some of the reported observations are difficult to follow. In particular, the associations between ASL and resting fMRI illustrated in Figure 7 and described in the accompanying Results text are difficult to understand. It could also be clearer whether the spatial maps showing ASL correlates of other image-derived phenotypes in Figure 6B are global correlations or confined to specific regions of interest. Finally, while addressing partial volume effects in gray matter regions by covarying for cortical thickness is a reasonable approach, the Methods section seems to imply that a global mean cortical thickness is used, which could be problematic given that cortical thickness changes may be localized.
Reviewer #3 (Public review):
Summary:
This is an extremely important manuscript in the evolution of cerebral perfusion imaging using Arterial Spin Labelling (ASL). The number of subjects that were scanned has provided the authors with a unique opportunity to explore many potential associations between regional cerebral blood flow (CBF) and clinical and demographic variables.
Strengths:
The major strength of the manuscript is the access to an unprecedentedly large cohort of subjects. It demonstrates the sensitivity of regional tissue blood flow in the brain as an important marker of resting brain function. In addition, the authors have demonstrated a thorough analysis methodology and good statistical rigour.
Weaknesses:
This reviewer did not identify any major weaknesses in this work.
eLife Assessment
This important study presents convincing evidence that uncovers a novel signaling axis impacting the post-mating response in females of the brown planthopper. The findings open several avenues for testing the molecular and neurobiological mechanisms of mating behavior in insects, although broad concerns remain about the relevance of some claims.
Reviewer #1 (Public review):
In this work, Zhang et al, through a series of well-designed experiments, present a comprehensive study exploring the roles of the neuropeptide Corazonin (CRZ) and its receptor in controlling the female post-mating response (PMR) in the brown planthopper (BPH) Nilaparvata lugen and Drosophila melanogaster. Through a series of behavioural assays, micro-injections, gene knockdowns, Crispr/Cas gene editing, and immunostaining, the authors show that both CRZ and CrzR play a vital role in the female post-mating response, with impaired expression of either leading to quicker female remating and reduced ovulation in BPH. Notably, the authors find that this signaling is entirely endogenous in BPH females, with immunostaining of male accessory glands (MAGs) showing no evidence of CRZ expression. Further, the authors demonstrate that while CRZ is not expressed in the MAGs, BPH males with Crz knocked out show transcriptional dysregulation of several seminal fluid proteins and functionally link this dysregulation to an impaired PMR in BPH. In relation, the authors also find that in CrzR mutants, the injection of neither MAG extracts nor maccessin peptide triggered the PMR in BPH females. Finally, the authors extend this study to D. melanogaster, albeit on a more limited scale, and show that CRZ plays a vital role in maintaining PMR in D. melanogaster females with impaired CRZ signaling, once again leading to quicker female remating and reduced ovulation. The authors must be commended for their expansive set of complementary experiments. The manuscript is also generally well written. Given the seemingly conserved nature of CRZ, this work is a significant addition to the literature, opening several avenues for testing the molecular and neurobiological mechanisms in which CRZ triggers the PMR.
However, there are some broad concerns/comments I had with this manuscript. The authors provide clear evidence that CRZ signaling plays a major role in the PMR of D. melanogaster, however, they provide no evidence that CRZ signaling is endogenous, as they did not check for expression in the MAGs of D. melanogaster males. Additionally, while the authors show that manipulating Crz in males leads to dysregulated seminal fluid expression and impaired PMR in BPH, the authors also find that CRZ injection in males in and of itself impairs PMR in BPH. The authors do not really address what this seemingly contradictory result could mean. While a lot of the figures have replicate numbers, the authors do not factor in replicate as an effect into their models, which they ideally should do.
Finally, while the discussion is generally well-written, it lacks a broader conclusion about the wider implications of this study and what future work building on this could look like.
Reviewer #2 (Public review):
Summary:
The work presented by Zhang and coauthors in this manuscript presents the study of the neuropeptide corazonin in modulating the post-mating response of the brown planthopper, with further validation in Drosophila melanogaster. To obtain their results, the authors used several different techniques that orthogonally demonstrate the involvement of corazonin signalling in regulating the female post-mating response in these species.
They first injected synthetic corazonin peptide into female brown planthoppers, showing altered mating receptivity in virgin females and a higher number of eggs laid after mating. The role of corazonin in controlling these post-mating traits has been further validated by knocking down the expression of the corazonin gene by RNA interference and through CRISPR-Cas9 mutagenesis of the gene. Further proof of the importance of corazonin signalling in regulating the female post-mating response has been achieved by knocking down the expression or mutagenizing the gene coding for the corazonin receptor.
Similar results have been obtained in the fruit fly Drosophila melanogaster, suggesting that corazonin signalling is involved in controlling the female post-mating response in multiple insect species.<br /> Notably, the authors also show that corazonin controls gene expression in the male accessory glands and that disruption of this pathway in males compromises their ability to elicit normal post-mating responses in their mates.
Strengths:
The study of the signalling pathways controlling the female post-mating response in insects other than Drosophila is scarce, and this limits the ability of biologists to draw conclusions about the evolution of the post-mating response in female insects. This is particularly relevant in the context of understanding how sexual conflict might work at the molecular and genetic levels, and how, ultimately, speciation might occur at this level. Furthermore, the study of the post-mating response could have practical implications, as it can lead to the development of control techniques, such as sterilization agents.
The study, therefore, expands the knowledge of one of the signalling pathways that control the female post-mating response, the corazonin neuropeptide. This pathway is involved in controlling the post-mating response in both Nilaparvata lugens (the brown planthopper) and Drosophila melanogaster, suggesting its involvement in multiple insect species.
The study uses multiple molecular approaches to convincingly demonstrate that corazonin controls the female post-mating response.
Weaknesses:
The data supporting the main claims of the manuscript are solid and convincing. The statistical analysis of some of the data might be improved, particularly by tailoring the analysis to the type of data that has been collected.
In the case of the corazonin effect in females, all the data are coherent; in the case of CRISPR-Cas9-induced mutagenesis, the analysis of the behavioural trait in heterozygotes might have helped in understanding the haplosufficiency of the gene and would have further proved the authors' point.
Less consistency was achieved in males (Figure 5): the authors show that injection of CRZ and RNAi of crz, or mutant crz, has the same effect on male fitness. However, the CRZ injection should activate the pathway, and crz RNAi and mutant crz should inhibit the pathway, yet they have the same effect. A comment about this discrepancy would have improved the clarity of the manuscript, pointing to new points that need to be clarified and opening new scientific discussion.
eLife Assessment
This valuable study addresses a critical and timely question regarding the role of a subpopulation of cortical interneurons (Chrna2-expressing Martinotti cells) in motor learning and cortical dynamics. However, while some of the behavior and imaging data are impressive, the small sample sizes and incomplete behavioral and activity analyses make interpretation difficult; therefore, they are insufficient to support the central conclusions. The study may be of interest to neuroscientists studying cortical neural circuits, motor learning, and motor control.
Reviewer #1 (Public review):
In this study, the authors investigated a specific subtype of SST-INs (layer 5 Chrna2-expressing Martinotti cells) and examined its functional role in motor learning. Using endoscopic calcium imaging combined with chemogenetics, they showed that activation of Chrna2 cells reduces the plasticity of pyramidal neuron (PyrN) assemblies but does not affect the animals' performance. However, activating Chrna2 cells during re-training improved performance. The authors claim that activating Chrna2 cells likely reduces PyrN assembly plasticity during learning and possibly facilitates the expression of already acquired motor skills.
There are many major issues with the study. The findings across experiments are inconsistent, and it is unclear how the authors performed their analyses or why specific time points and comparisons were chosen. The study requires major re-analysis and additional experiments to substantiate its conclusions.
Major Points:
(1a) Behavior task - the pellet-reaching task is a well-established paradigm in the motor learning field. Why did the authors choose to quantify performance using "success pellets per minute" instead of the more conventional "success rate" (see PMID 19946267, 31901303, 34437845, 24805237)? It is also confusing that the authors describe sessions 1-5 as being performed on a spoon, while from session 6 onward, the pellets are presented on a plate. However, in lines 710-713, the authors define session 1 as "naïve," session 2 as "learning," session 5 as "training," and "retraining" as a condition in which a more challenging pellet presentation was introduced. Does "naïve session 1" refer to the first spoon session or to session 6 (when the food is presented on a plate)? The same ambiguity applies to "learning session 2," "training session 5," and so on. Furthermore, what criteria did the authors use to designate specific sessions as "learning" versus "training"? Are these definitions based on behavioral performance thresholds or some biological mechanisms? Clarifying these distinctions is essential for interpreting the behavioral results.
(1b) Judging from Figures 1F and 4B, even in WT mice, it is not convincing that the animals have actually learned the task. In all figures, the mice generally achieve ~10-20 pellets per minute across sessions. The only sessions showing slightly higher performance are session 5 in Figure 1F ("train") and sessions 12 and 13 in Figure 4B ("CLZ"). In the classical pellet-reaching task, animals are typically trained for 10-12 sessions (approximately 60 trials per session, one session per day), and a clear performance improvement is observed over time. The authors should therefore present performance data for each individual session to determine whether there is any consistent improvement across days. As currently shown, performance appears largely unchanged across sessions, raising doubts about whether motor learning actually occurred.
(1c) The authors also appear to neglect existing literature on the role of SST-INs in motor learning and local circuit plasticity (e.g., PMID 26098758, 36099920). Although the current study focuses on a specific subpopulation of SST-INs, the results reported here are entirely opposite to those of previous studies. The authors should, at a minimum, acknowledge these discrepancies and discuss potential reasons for the differing outcomes in the Discussion section.
(2a) Calcium imaging - The methodology for quantifying fluorescence changes is confusing and insufficiently described. The use of absolute ΔF values ("detrended by baseline subtraction," lines 565-567) for analyses that compare activity across cells and animals (e.g., Figure 1H) is highly unconventional and problematic. Calcium imaging is typically reported as ΔF/F₀ or z-scores to account for large variations in baseline fluorescence (F₀) due to differences in GCaMP expression, cell size, and imaging quality. Absolute ΔF values are uninterpretable without reference to baseline intensity - for example, a ΔF of 5 corresponds to a 100% change in a dim cell (F₀ = 5) but only a 1% change in a bright cell (F₀ = 500). This issue could confound all subsequent population-level analyses (e.g., mean or median activity) and across-group comparisons. Moreover, while some figures indicate that normalization was performed, the Methods section lacks any detailed description of how this normalization was implemented. The critical parameters used to define the baseline are also omitted. The authors should reprocess the imaging data using a standardized ΔF/F₀ or z-score approach, explicitly define the baseline calculation procedure, and revise all related figures and statistical analyses accordingly.
(2b) Figure 1G - It is unclear why neural activity during successful trials is already lower one second before movement onset. Full traces with longer duration before and after movement onset should also be shown. Additionally, only data from "session 2 (learning)" and a single neuron are presented. The authors should present data across all sessions and multiple neurons to determine whether this observation is consistent and whether it depends on the stage of learning.
(2c) Figure 1H - The authors report that chemogenetic activation of Chrna2 cells induces differential changes in PyrN activity between successful and failed trials. However, one would expect that activating all Chrna2 cells would strongly suppress PyrN activity rather than amplifying the activity differences between trials. The authors should clarify the mechanism by which Chrna2 cell activation could exaggerate the divergence in PyrN responses between successful and failed trials. Perhaps, performing calcium imaging of Chrna2 cells themselves during successful versus failed trials would provide insight into their endogenous activity patterns and help interpret how their activation influences PyrN activity during successful and failed trials.
(2d) Figure 1H - Also, in general, the Cre⁺ (red) data points appear consistently higher in activity than the Cre⁻ (black) points. This is counterintuitive, as activating Chrna2 cells should enhance inhibition and thereby reduce PyrN activity. The authors should clarify how Cre⁺ animals exhibit higher overall PyrN activity under a manipulation expected to suppress it. This discrepancy raises concerns about the interpretation of the chemogenetic activation effects and the underlying circuit logic.
(3) The statistical comparisons throughout the manuscript are confusing. In many cases, the authors appear to perform multiple comparisons only among the N, L, T, and R conditions within the WT group. However, the central goal of this study should be to assess differences between the WT and hM3D groups. In fact, it is unclear why the authors only provide p-values for some comparisons but not for the majority of the groups.
(4a) Figure 4 - It is hard to understand why the authors introduce LFP experiments here, and the results are difficult to interpret in isolation. The authors should consider combining LFP recordings with calcium imaging (as in Figure 1) or, alternatively, repeating calcium imaging throughout the entire re-training period. This would provide a clearer link between circuit activity and behavior and strengthen the conclusions regarding Chrna2 cell function during re-training.
(4b) It is unclear why CLZ has no apparent effect in session 11, yet induces a large performance increase in sessions 12 and 13. Even then, the performance in sessions 12 and 13 (~30 successful pellets) is roughly comparable to Session 5 in Figure 1F. Given this, it is questionable whether the authors can conclude that Chrna2 cell activation truly facilitates previously acquired motor skills?
(5) Figure 5 - The authors report decreased performance in the pasta-handling task (presumably representing a newly learned skill) but observe no difference in the pellet-reaching task (presumably an already acquired skill). This appears to contradict the authors' main claim that Chrna2 cell activation facilitates previously acquired motor skills.
(6) Supplementary Figure 1 - The c-fos staining appears unusually clean. Previous studies have shown that even in home-cage mice, there are substantial numbers of c-fos⁺ cells in M1 under basal conditions (PMID 31901303, 31901303). Additionally, the authors should present Chrna2 cell labeling and c-fos staining in separate channels. As currently shown, it is difficult to determine whether the c-fos⁺ cells are truly Chrna2 cells⁺.
Overall, the authors selectively report statistical comparisons only for findings that support their claims, while most other potentially informative comparisons are omitted. Complete and transparent reporting is necessary for proper interpretation of the data.
Reviewer #2 (Public review):
Summary:
In this manuscript, Malfatti et al. study the role of Chrna2 Martinotti cells (Mα2 cells), a subset of SST interneurons, for motor learning and motor cortex activity. The authors trained mice on a forelimb prehension task while recording neuronal activity of pyramidal cells using calcium imaging with a head-mounted miniscope. While chemogenetically increasing Mα2 cell activity did not affect motor learning, it changed pyramidal cell activity such that activity peaks became sharper and differently timed than in control mice. Moreover, co-active neuronal assemblies become more stable with a smaller spatial distribution. Increasing Mα2 cell activity in previously trained mice did increase performance on the prehension task and led to increased theta and gamma band activity in the motor cortex. On the other hand, genetic ablation of Mα2 cells affected fine motor movements on a pasta handling task while not affecting the prehension task.
Strengths:
The proposed question of how Chrna2-expressing SST interneurons affect motor learning and motor cortex activity is important and timely. The study employs sophisticated approaches to record neuronal activity and manipulate the activity of a specific neuronal population in behaving mice over the course of motor learning. The authors analyze a variety of neuronal activity parameters, comparing different behavior trials, stages of learning, and the effects of Mα2 cell activation. The analysis of neuronal assembly activity and stability over the course of learning by tracking individual neurons throughout the imaging sessions is notable, since technically challenging, and yielded the interesting result that neuronal assemblies are more stable when activating Mα2 cells.
Overall, the study provides compelling evidence that Mα2 cells regulate certain aspects of motor behaviors, likely by shaping circuit activity in the motor cortex.
Weaknesses:
The main limitation of the study lies in its small sample sizes and the absence of key control experiments, which substantially weaken the strength of the conclusions.
Core findings of this paper, such as the lack of effect of Mα2 cell activation on motor learning, as well as the altered neuronal activity, rely ona sample size of n=3 mice per condition, which is likely underpowered to detect differences in behavior and contributes to the somewhat disconnected results on calcium activity, activity timing, and neuronal assembly activity.
More comprehensive analyses and data presentation are also needed to substantiate the results. For example, examining calcium activity and behavioral performance on a trial-by-trial basis could clarify whether closely spaced reaching attempts influence baseline signals and skew interpretation.
The study uses cre-negative mice as controls for hM3Dq-mediated activation, which does not account for potential effects of Cre-dependent viral expression that occur only in Cre-positive mice.
This important control would be necessary to substantiate the conclusion that it is increased Mα2 cell activity that drives the observed changes in behavior and cortical activity.
eLife Assessment
This valuable study shows that regions of the human auditory cortex that respond strongly to voices are also sensitive to vocalizations from closely related primate species. The study is methodologically solid, though additional analyses - particularly those isolating the acoustic features that differentiate chimpanzee from bonobo calls - would further strengthen the conclusions. With additional analyses and discussions, the work has the potential to offer key insights into the evolutionary continuity of voice processing and would be of interest to researchers studying auditory processing and evolutionary neuroscience in general.
Reviewer #1 (Public review):
Summary:
This study investigates how human temporal voice areas (TVA) respond to vocalizations from nonhuman primates. Using functional MRI during a species-categorization task, the authors compare neural responses to calls from humans, chimpanzees, bonobos, and macaques while modeling both acoustic and phylogenetic factors. They find that bilateral anterior TVA regions respond more strongly to chimpanzee than to other nonhuman primate vocalizations, suggesting that these regions are sensitive not only to human voices but also to acoustically and evolutionarily related sounds.
The work provides important comparative evidence for continuity in primate vocal communication and offers a strong empirical foundation for modeling how specific acoustic features drive TVA activity.
Strengths:
(1) Comparative scope: The inclusion of four primate species, including both great apes and monkeys, provides a rare and valuable cross-species perspective on voice processing.
(2) Methodological rigor: Acoustic and phylogenetic distances are carefully quantified and incorporated into the analyses.
(4) Neuroscientific significance: The finding of TVA sensitivity to chimpanzee calls supports the view that human voice-selective regions are evolutionarily tuned to certain acoustic features shared across primates.
(4) Clear presentation: The study is well organized, the stimuli well controlled, and the imaging analyses transparent and replicable.
(5) Theoretical contribution: The results advance understanding of the neural bases of voice perception and the evolutionary roots of voice sensitivity in the human brain.
Weaknesses:
(1) Acoustic-phylogenetic confound: The design does not fully disentangle acoustic similarity from phylogenetic proximity, as species co-vary along both dimensions. A promising way to address this would be to include an additional model focusing on the acoustic features that specifically differentiate bonobo from chimpanzee calls, which share equal phylogenetic distance to humans.
(2) Selectivity vs. sensitivity: Without non-vocal control sounds, the study cannot determine whether TVA responses reflect true selectivity for primate vocalizations or general auditory sensitivity.<br /> <br /> (3) Task demands: The use of an active categorization task may engage additional cognitive processes beyond auditory perception; a passive listening condition would help clarify the contribution of attention and task performance.
(4) Figures and presentation: Some results are partially redundant; keeping only the most representative model figure in the main text and moving others to the Supplementary Material would improve clarity.
Reviewer #2 (Public review):
Summary:
This study investigated how the human brain responds to vocalizations from multiple primate species, including humans, chimpanzees, bonobos, and rhesus macaques. The central finding - that subregions of the temporal voice areas (TVA), particularly in the bilateral anterior superior temporal gyrus, show enhanced responses to chimpanzee vocalizations - suggests a potential neural sensitivity to calls from phylogenetically close nonhuman primates.
Strengths:
The authors employed three analytical models to consistently demonstrate activation in the anterior superior temporal gyrus that is specific to chimpanzee calls. The methodology was logical and robust, and the results supporting these findings appear solid.
Weakness:
The interpretation of the findings in this paper regarding the evolutionary continuity of voice processing lacks sufficient evidence. A simple explanation is that the observed effects can be attributed to the similarity in low-level acoustic features, rather than effects specific to phylogenetically close species. The authors only tested vocalizations from three non-human primate species, other than humans. In this case, the species specificity of the effect does not fully represent the specificity of evolutionary relatedness.
Reviewer #3 (Public review):
Summary:
Ceravolo et al. employed functional magnetic resonance imaging (fMRI) to examine how the temporal voice areas (TVA) in the human brain respond to vocalizations from different nonhuman primate species. Their findings reveal that the human TVA is not only responsible for human vocalizations but also exhibits sensitivity to the vocalizations of other primates, particularly chimpanzee vocalizations sharing acoustic similarities with human voices, which offers compelling evidence for cross-species vocal processing in the human auditory system. Overall, the study presents intellectually stimulating hypotheses and demonstrates methodological originality. However, the current findings are not yet solid enough to fully support the proposed claims, and the presentation could be enhanced for clarity and impact.
Strengths:
The study presents intellectually stimulating hypotheses and demonstrates methodological originality.
Weaknesses:
(1) The analysis of the fMRI data does not account for the participants' behavioral performance, specifically their reaction times (RTs) during the species categorization task.
(2) The figure organization/presentation requires significant revision to avoid confusion and redundancy.
eLife Assessment
This valuable simulation study proposes a new coarse-grained model to explain the effects of CpG methylation on nucleosome wrapping energy. The model accurately reproduces the all-atom molecular dynamics simulation data, and the evidence to support the claims in the paper is solid. This work will be of interest to researchers working on gene regulation, mechanisms of DNA methylation and effects of DNA methylation on nucleosome positioning.
Reviewer #1 (Public review):
In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.
The major strength of this method is that the model explicitly includes the phosphate group as DNA-histone binding site constraints, enhancing CG model accuracy and computational efficiency and allowing comprehensive calculations of DNA mechanical properties and deformation energies.
The revised version has addressed the concerns raised previously, significantly strengthening the study.
Reviewer #2 (Public review):
Summary:
This study uses a coarse-grained model for double stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitely for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.
It is still unclear whether this harmonic model parametrized for unbound DNA is accurate enough to describe DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence dependent nucleosome behavior. This is at least the case as long as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistingly challenging in all nucleosome models to my knowledge.
It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.
Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as this allows to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type specific way.
Overall, this is an important contribution to the questions of how sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open. Despite of this, I highly recommend publication of this manuscript.
Strengths:
The authors use their state-of-the-art coarse grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.
Weaknesses:
The authors introduce penalty coefficients c_i to avoid steric clashes between the two DNA turns in the nucleosome. This requires c_i-values that are so high that standard deviations in the fluctuations of the simulation are smaller than in the experiments.
Reviewer #3 (Public review):
Summary:
In this study, authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.
Strengths:
The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.
Weaknesses:
Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.
Comments on revised version:
The authors have addressed my concerns.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public Review):
Summary:
In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.
Strengths:
The major strength of this method is the model explicitly includes phosphate group as DNA-histone binding site constraints, enhancing CG model accuracy and computational efficiency and allowing comprehensive calculations of DNA mechanical properties and deformation energies.
Weaknesses:
A significant limitation of this study is that the parameter sets for the methylated and hydroxymethylated CpG steps in the cgNA+ model are derived from all-atom molecular dynamics (MD) simulations that use previously established force field parameters for modified cytosines (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 2021). These parameters suggest that both methylated and hydroxymethylated cytosines increase DNA stiffness and nucleosome wrapping energy, which could predispose the coarse-grained model to replicate these findings. Notably, conflicting results from other all-atom MD simulations, such as those by Ngo T in Nat. Commun. 2016, shows that hydroxymethylated cytosines increase DNA flexibility, contrary to methylated cytosines. If the cgNA+ model were trained on these later parameters or other all-atom MD force fields, different conclusions might be obtained regarding the effects of methylated and hydroxymethylation on nucleosome formation.
Despite the training parameters of the cgNA+ model, the results presented in the manuscript indicate that methylated cytosines increase both DNA stiffness and nucleosome wrapping energy. However, when comparing nucleosome occupancy scores with predicted nucleosome wrapping energies and optimal configurations, the authors find that methylated CGIs exhibit higher nucleosome occupancies than unmethylated ones, which seems to contradict the expected relationship where increased stiffness should reduce nucleosome formation affinity. In the manuscript, the authors also admit that these conclusions “apparently runs counter to the (perhaps naive) intuition that high nucleosome forming affinity should arise for fragments with low wrapping energy”. Previous all-atom MD simulations (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 202; Ngo T, et al. Nat. Commun. 20161) show that the stiffer DNA upon CpG methylation reduces the affinity of DNA to assemble into nucleosomes or destabilizes nucleosomes. Given these findings, the authors need to address and reconcile these seemingly contradictory results, as the influence of epigenetic modifications on DNA mechanical properties and nucleosome formation are critical aspects of their study.
Understanding the influence of sequence-dependent and epigenetic modifications of DNA on mechanical properties and nucleosome formation is crucial for comprehending various cellular processes. The authors’ study, focusing on these aspects, definitely will garner interest from the DNA methylation research community.
Training the cgNA+ model on alternative MD simulation datasets is certainly of interest to us. However, due to the significant computational cost, this remains a goal for future work. The relationship between nucleosome occupancy scores and nucleosome wrapping energy is still debated, as noted in our Discussion section. The conflicting results may reflect differences in experimental conditions and the contribution of cellular factors other than DNA mechanics to nucleosome formation in vivo. For instance, P´erez et al. (2012), Battistini et al. (2021), and Ngo et al. (2016) concluded that DNA methylation reduces nucleosome formation based on experiments with modified Widom 601 sequences. In contrast, the genome-wide methylation study by Collings and Anderson (2017) found the opposite effect. In our work, we also use whole-genome nucleosome occupancy data.
Comments on revised version:
The authors have addressed most of my comments and concerns regarding this manuscript.
Reviewer #2 (Public Review):
Summary:
This study uses a coarse-grained model for double stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitly for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.
It is still unclear whether this harmonic model parametrized for unbound DNA is accurate enough to describe DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence dependent nucleosome behavior. This is at least the case as long as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistently challenging in all nucleosome models to my knowledge.
It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.
Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as this allows to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type specific way.
Overall, this is an important contribution to the questions of how sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open. Despite of this, I highly recommend publication of this manuscript.
Strengths:
The authors use their state-of-the-art coarse grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.
Weaknesses:
The authors introduce penalty coefficients c<sub>i</sub> to avoid steric clashes between the two DNA turns in the nucleosome. This requires c<sub>i</sub>-values that are so high that standard deviations in the fluctuations of the simulation are smaller than in the experiments.
Indeed, smaller c<sub>i</sub> values lead to steric clashes between the two turns of DNA. A possible improvement of our optimisation method and a direction of future work would be adding a penalty which prevents steric clashes to the objective function. Then the c<sub>i</sub> values could be reduced to have bigger fluctuations that are even closer to the experimental structures.
Reviewer #3 (Public Review):
Summary:
In this study, authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.
Strengths:
The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.
Weaknesses:
Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.
We have revised the manuscript in accordance with the reviewer’s latest suggestions.
Comments on revised version:
Authors have attempted to address previously raised concerns.
Reviewer #1 (Recommendations for the authors):
The authors have addressed most of my comments and concerns regarding this manuscript. Among them, the most significant pertains to fitting the coarse-grained model using a different all-atom force field to verify the conclusions. The authors acknowledged this point but noted the computational cost involved and proposed it as a direction for future work. Overall, I recommend the revised version for publication.
Reviewer #2 (Recommendations for the authors):
My previous comments were addressed satisfactorily.
Reviewer #3 (Recommendations for the authors):
Authors have attempted to address previously raised concerns. However, some concerns listed below remain that need to be addressed.
(1) The first reviewer makes a valid point regarding the reconciliation of conflicting observations related to nucleosome-forming affinity and wrapping energy. Unfortunately, the authors don’t seem to address this and state that this will be the goal for the future study.
Training the cgNA+ model on alternative MD simulation datasets remains future work. However, we revised the Discussion section to more clearly address the conflicting experimental findings in the literature on how DNA methylation influences nucleosome formation.
(2) Please report the effect size and statistical significance value for Figures 7 and 8, as this information is currently not provided, despite the authors’ claim that these observations are statistically significant.
This information is now presented in Supplementary Tables S1-S4.
(3) In response to the discrepancy in cell lines for correlating nucleosome occupancy and methylation analyses, the authors claim that there is no publicly available nucleosome occupancy and methylation data for a human cell type within the human genome. This claim is confusing, as the GM12878 cell line has been extensively characterized with MNaseseq and WGBS.
We thank the reviewer for this remark. We have removed the statement regarding the lack of data from the manuscript; we intend to examine the suggested cell line in future research.
(4) In response to my question, the authors claimed that they selected regions from chromosome 1 exclusively; however, the observation remains unchanged when considering sequence samples from different genomic regions. They should provide examples from different chromosomes as part of the supplementary information to further support this.
The examples of corresponding plots for other nucleosomes are now shown in Supplementary Figure S9.
eLife Assessment
This useful study identifies knowledge of letter shape as a distinct component of letter knowledge and shows that children acquire it even before formal reading instruction and without knowing the corresponding letter sounds. However, the evidence supporting the main conclusions is incomplete at the current stage. With additional analyses examining the relationships among the underlying variables and/or revising interpretations, the work would be of broad interest to researchers studying language and vision.
Reviewer #1 (Public review):
Summary:
This study examines letter-shape knowledge in a large cohort of children with minimal formal reading instruction. The authors report that these children can reliably distinguish upright from inverted letters despite limited letter naming abilities. They also show a visual-search advantage for upright over inverted letters, and this advantage correlates with letter-shape familiarity. These findings suggest that specialized letter-shape representations can emerge with very limited letter-sound mapping practice.
Strengths:
This study investigates whether children can develop letter-shape knowledge independently of letter-sound mapping abilities. This question is theoretically important, especially in light of functional subdivisions within the visual word form area (VWFA), with posterior regions associated with letter/orthographic shape and anterior regions with linguistic features of orthography (Caffarra et al., 2021; Lerma-Usabiaga et al., 2018). The study also includes a large sample of children at the very beginning of formal reading instruction, thereby minimizing the influence of explicit instruction on the formation of letter-shape knowledge.
Weakness:
A central concern is that a production task (naming) is used to index letter-name knowledge, whereas letter-shape knowledge is assessed with recognition. Production tasks impose additional demands (motor planning, articulation) and typically yield lower performance than recognition tasks (e.g., letter-sound verification). Thus, comparisons between letter-shape and letter-name knowledge are confounded by task type. The authors' partial-correlation and multiple-regression analyses linking familiarity (but not production) to the upright-search advantage are informative; however, they do not resolve the recognition-versus-production mismatch. Consequently, the current data cannot unambiguously support the claim that letter-shape representations are independent of letter-name knowledge.
Reviewer #2 (Public review):
Summary:
In this study, the authors propose that there are two types of letter knowledge: knowledge about letter sound and knowledge about letter shape. Based on previous studies on implicit statistical learning in adults and babies, the authors hypothesized that passive exposure to letters in the environment allows early readers to acquire knowledge of letter shapes even before knowledge of letter-sound association. Children performed a set of experiments that measures letter shape familiarity, letter-sound association performance, visual processing of letters, and a reading-related cognitive skill. The results show that even the children who have little to no knowledge of letter names are familiar with letter shapes, and that this letter shape familiarity is predictive of performance in visual processing of letters.
Strengths:
The authors' hypothesis is based on widely accepted findings in vision science that repeated exposure to certain stimuli promotes implicit learning of, for example, statistical properties of the stimuli. They used simple and well-established tasks in large-scale experiments with a special population (i.e., children). The data analysis is quite comprehensive, accounting for any alternative explanations when needed. The data support at least a part of their hypothesis that the knowledge of letter shapes is distinct from, and precedes, the knowledge of letter-sound association, and is associated with performance in visual processing of the letters. This study shed light on a rather overlooked aspect of letter knowledge, i.e., letter shapes, challenging the idea that letters are learned only through formal instruction and calling for future research on the role of passive exposure to letters in reading acquisition.
Weaknesses:
Although the authors have successfully identified the knowledge of letter shapes as another type of letter knowledge other than the knowledge of letter-sound association, the question of whether it drives the subsequent reading acquisition remains largely unanswered, despite it being strongly implied in the Introduction. The authors collected a RAN score, which is known to robustly predict future reading fluency, but it did not show a significant partial correlation with familiarity accuracy (i.e., familiarity accuracy is not necessary to predict RAN score). The authors discussed that the performance in visual processing of letters might capture unique variance in reading fluency unexplained by RAN scores, but currently, this claim seems speculative.
Since even children without formal literacy instruction were highly familiar with letter shapes, it would be reasonable to assume that they had obtained the knowledge through passive exposure. However, the role of passive exposure was not directly tested in the study.
Given the superimposed straight lines in Figure 2, I assume the authors computed Pearson correlation coefficients. Testing the statistical significance of the Pearson correlation coefficient requires the assumption of bivariate normality (and therefore constant variance of a variable across the range of the other). According to Figure 2, this doesn't seem to be met, as the familiarity accuracy is hitting the ceiling. The ceiling effect might not be critical in Figure 2, since it tends to attenuate correlation, not inflate it. But in Figures 3 and 4, the authors' conclusion depends on the non-significant partial correlation. In fact, the authors themselves wrote that the ceiling effect might lead to a non-significant correlation even if there is an actual effect (line 404).
Reviewer #3 (Public review):
Summary:
This study examined how young children with minimal reading instruction process letters, focusing on their familiarity with letter shapes, knowledge of letter names, and visual discrimination of upright versus inverted letters. Across four experiments, kindergarten and Grade 1 children could identify the correct orientation of letters even without knowing their names.
Strengths:
This study addresses an important research gap by examining whether children develop letter familiarity prior to formal literacy instruction and how this skill relates to reading-related cognitive abilities. By emphasizing letter familiarity alongside letter recognition, the study highlights a potentially overlooked yet important component of emergent literacy development.
Weaknesses:
The study's methods and results do not effectively test its stated research goals. Reading ability was not directly measured; instead, the authors inferred its relationship with reading from correlations between letter familiarity and reading-related cognitive measures, which limits the validity of their conclusions. Furthermore, the analytical approach was rather limited, relying primarily on simple and partial correlations without employing more advanced statistical methods that could better capture the underlying relationships.
Major Comments:
(1) Limited Novelty and Unclear Theoretical Contribution:
The authors aim to challenge the view that children acquire letter shape knowledge only through formal literacy instruction, but similar questions regarding letter familiarity have already been explored in previous research. The manuscript does not clearly articulate how the present study advances beyond existing findings or why examining letter familiarity specifically before formal instruction provides new theoretical insight. Moreover, if letter familiarity and letter recognition are treated as distinct constructs, the authors should better justify their differentiation and clarify the theoretical significance of focusing on familiarity as an independent component of emergent literacy.
(2) Overgeneralization to Reading Ability:
Although the study measured several literacy-related cognitive skills and examined correlations with letter familiarity, it did not directly assess children's reading ability, as participants had not yet received formal literacy instruction. Therefore, the conclusion that letter familiarity influences reading skills (e.g., Line 519: "Our results are broadly consistent with previous work that has highlighted print letter knowledge as a strong predictor of future reading skills") is not fully supported and should be clarified or revised. To draw conclusions about the impact on reading ability, a longitudinal study would be more appropriate, assessing the relationship between letter familiarity and reading skills after children have received formal literacy instruction. If a longitudinal study is not feasible, measuring familial risk for dyslexia could provide an alternative approach to infer the potential influence of letter familiarity on later reading development.
(3) Confusing and Limited Analytical Approach with Potential for More Sophisticated Modeling:
The study employs a confusing analytical approach, alternating between simple correlational analyses and group-based comparisons, which may introduce circularity - for example, defining high vs. low familiarity groups partly based on performance differences in upright versus inverted letters and then observing a visual search advantage for upright letters within these groups. Moreover, the analyses are relatively simple: although multiple linear regression is mentioned, the results are not fully reported. These approaches may not fully capture the complex relationships among letter familiarity, recognition, visual search performance, RAN, and other covariates. More sophisticated modeling, such as mixed-effects models to account for repeated measures, structural equation modeling to examine latent constructs, or multivariate approaches jointly modeling familiarity and recognition effects, could provide a clearer understanding of the unique contribution of letter shape familiarity to early literacy outcomes. In addition, a large number of correlations were conducted without correction for multiple comparisons, which may increase the risk of false positives and raise concerns about the reliability of some significant findings.
eLife Assessment
This important work develops a new protocol to experimentally perturb target genes across a quantitative range of expression levels in cell lines. The evidence supporting their new perturbation approach is convincing, and we propose that focusing on single modality (activation or inhibition) would be sufficient to draw their conclusions. The study will be of broad interest to scientists in the fields of functional genomics and biotechnology.
Reviewer #1 (Public review):
In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.
One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.
Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.
To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.
Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.
The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.
Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.
There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.
Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.
Comments on revisions:
Thank you for responding to our comments. We have no further comments for the authors.
Reviewer #2 (Public review):
Summary:
This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2 and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.
Strengths:
The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose-responses, a general approach that likely can be built upon in the future.<br /> Moreover, the authors highlight tiling of sgRNAs +/-1000bp around TSS as a useful approach. Compared with conventional direct TSS-targeting (+/- 200 bp), the larger sequence window allows placing more sgRNAs. Also it requires little prior knowledge of CREs, and avoids using "attenuated" sgRNAs which would require specialized sgRNA design.
Weaknesses:
The experiment was performed in a single replicate and it would have been reassuring to see an independent validation of the main findings, for example through measuring individual dose-response curves .
Much of the analysis depends on the estimation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. Generally, biological replicates are recommended for differential gene expression testing (Squair et al. 2021, https://doi.org/10.1038/s41467-021-25960-2). When using the FindMarkers function from the Seurat package, the authors divert from the recommendations for pseudo-bulk analysis to aggregate the raw counts (https://satijalab.org/seurat/articles/de_vignette.html). Furthermore, differential gene expression analysis of scRNA-seq data can suffer from mis-estimations (Nguyen et al. 2023, https://doi.org/10.1038/s41467-023-37126-3), and different computational tools or versions can affect these estimates strongly (Pullin et al. 2024, https://doi.org/10.1186/s13059-024-03183-0 and Rich et al. 2024, https://doi.org/10.1101/2024.04.04.588111). Therefore it would be important to describe more precisely in the Methods how this analysis was performed, any deviations from default parameters, package versions, and at which point which values were aggregated to form "pseudobulk" samples.
Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Fig. S1F, S3A) suggests clonal differences between the two lines. Similarly, the UMAP in S3C and the PCA in S4A suggest batch effects between the two lines. These might confound this analysis, even though all fold changes are calculated relative to the baseline expression in the respective cell line (NTC cells). Combining log2-fold changes from the two cell lines with different baseline expression into a single curve (e.g. Fig. 3) remains misleading, because different data points could be normalized to different base line expression levels.
The study estimates the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Fig. S3A shows that upon stronger knock-down, a subpopulation of cells appear, where the targeted TF is not detected any more (drop-outs). Also in Fig. 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).
One of the major conclusions of the study is that non-linear behavior is common. It would be helpful to show that this observation does not arise from the technical concerns described in the previous points. This could be done for instance with independent experimental validations.
Did the authors achieve their aims? Do the results support the conclusions?:
Some of the most important conclusions, such as the claim that non-linear responses are common, are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.
Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:
Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. The computational code repository contains all the valuable code with inline comments, but would have benefited from a readme file explaining the repository structure, package versions, and instructions to reproduce the analyses, including which input files or directory structure would be needed.
Author response:
The following is the authors’ response to the original reviews.
Reviewer #1 (Public review):
In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.
One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.
Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.
To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.
Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw a correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.
The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.
Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.
There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.
Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.
We thank the reviewer for their thoughtful and thorough evaluation of our study. We appreciate their recognition of the strengths of our approach, particularly the ability to modulate gene dosage within a physiological range and to capture non-linear dosage-response relationships. We also agree with the reviewer’s points regarding the limitations of gene selection and the use of K562 cells, and we are encouraged that the reviewer found our follow-up analyses and statistical framework to be well-supported. We believe this work provides a valuable foundation for future genome-wide applications and more physiologically relevant perturbation studies.
Reviewer #2 (Public review):
Summary:
This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2, and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.
We thank the reviewer for their detailed and thoughtful assessment of our work. We are encouraged by their recognition of the strengths of our study, including the value of quantitative CRISPR-based perturbation coupled with single-cell transcriptomics, and its potential to inform gene regulatory network inference. Below, we address each of the concerns raised:
Strengths:
The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose responses, a general approach that likely can be built upon in the future.
Weaknesses:
(1) The experiment was only performed in a single replicate. In the absence of an independent validation of the main findings, the robustness of the observations remains unclear.
We acknowledge that our study was performed in a single pooled experiment. While additional replicates would certainly strengthen the findings, in high-throughput single-cell CRISPR screens, individual cells with the same perturbation serve as effective internal replicates. This is a common practice in the field. Nevertheless, we agree that biological replicates would help control for broader technical or environmental effects.
(2) The analysis is based on the calculation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. How the fold changes were calculated exactly remains unclear, since it is only stated that the FindMarkers function from the Seurat package was used, which is likely not optimal for quantitative estimates. Furthermore, differential gene expression analysis of scRNA-seq data can suffer from data distortion and mis-estimations (Heumos et al. 2023 (https://doi.org/10.1038/s41576-023-00586-w), Nguyen et al. 2023 (https://doi.org/10.1038/s41467-023-37126-3)). In general, the pseudo-bulk approach used is suitable, but the correct treatment of drop-outs in the scRNA-seq analysis is essential.
We thank the reviewer for highlighting recent concerns in the field. A study benchmarking association testing methods for perturb-seq data found that among existing methods, Seurat’s FindMarkers function performed the best (T. Barry et al. 2024).
In the revised Methods, we now specify the formula used to calculate fold change and clarify that the estimates are derived from the Wilcoxon test implemented in Seurat’s FindMarkers function. We also employed pseudo-bulk grouping to mitigate single-cell noise and dropout effects.
(3) Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Figure S3A) suggests substantial clonal differences between the two lines. Similarly, the PCA in S4A suggests strong batch effects between the two lines. These might confound this analysis.
We agree that baseline differences between CRISPRi and CRISPRa lines could introduce confounding effects if not appropriately controlled for. We emphasize that all comparisons are made as fold changes relative to non-targeting control (NTC) cells within each line, thereby controlling for batch- and clone-specific baseline expression. See figures S4A and S4B.
(4) The study uses pseudo-bulk analysis to estimate the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Figure S3A shows that upon stronger knock-down, a subpopulation of cells appears, where the targeted TF is not detected anymore (drop-outs). Also Figure 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).
Figure S3A shows normalized expression values, not fold changes. A pseudobulk approach reduces single-cell noise and dropout effects. To test whether dropout events reflect true binary repression or technical effects, we compared trans-effects across cells with zero versus low-but-detectable target gene expression (Figure S3B). These effects were highly concordant, supporting the interpretation that dropout is largely technical in origin. We agree that KRAB-based repression can exhibit binary behavior in some contexts, but our data suggest that cells with intermediate repression exist and are biologically meaningful. In ongoing unpublished work, we pursue further analysis of these data at the single cell level, and show that for nearly all guides the dosage effects are indeed gradual rather than driven by binary effects across cells.
(5) One of the major conclusions of the study is that non-linear behavior is common. This is not surprising for gene up-regulation, since gene expression will reach a plateau at some point, but it is surprising to be observed for many genes upon TF down-regulation. Specifically, here the target gene responds to a small reduction of TF dose but shows the same response to a stronger knock-down. It would be essential to show that his observation does not arise from the technical concerns described in the previous point and it would require independent experimental validations.
This phenomenon—where relatively small changes in cis gene dosage can exceed the magnitude of cis gene perturbations—is not unique to our study. This also makes biological sense, since transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Empirically, these effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022), to name but a few studies that our lab has personally examined the data of.
(6) One of the conclusions of the study is that guide tiling is superior to other methods such as sgRNA mismatches. However, the comparison is unfair, since different numbers of guides are used in the different approaches. Relatedly, the authors point out that tiling sometimes surpassed the effects of TSS-targeting sgRNAs, however, this was the least fair comparison (2 TSS vs 10 tiling guides) and additionally depends on the accurate annotation of TSS in the relevant cell line.
We do not draw this conclusion simply from observing the range achieved but from a more holistic observation. We would like to clarify that the number of sgRNAs used in each approach is proportional to the number of base pairs that can be targeted in each region: while the TSS-targeting strategy is typically constrained to a small window of a few dozen base pairs, tiling covers multiple kilobases upstream and downstream, resulting in more guides by design rather than by experimental bias. The guides with mismatches do not have a great performance for gradual upregulation.
We would also like to point out that the observation that the strongest effects can arise from regions outside the annotated TSS is not unique to our study and has been demonstrated in prior work (referenced in the text).
To address this concern, we have revised the text to clarify that we do not consider guide tiling to be inherently superior to other approaches such as sgRNA mismatches. Rather, we now describe tiling as a practical and straightforward strategy to obtain a wide range of gene dosage effects without requiring prior knowledge beyond the approximate location of the TSS. We believe this rephrasing more accurately reflects the intent and scope of our comparison.
(7) Did the authors achieve their aims? Do the results support the conclusions?: Some of the most important conclusions are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.
We appreciate the reviewer’s concern, but we would have wished for a more detailed characterization of which conclusions are not supported, given that we believe our approach actually accounts for the major concerns raised above. We believe that the observation of non-linear effects is a robust conclusion that is also consistent with known biology, with this paper introducing new ways to analyze this phenomenon.
(8) Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:
Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. Missing documentation of the computational code repository reduces the utility of the methods and data significantly.
Documentation is included as inline comments within the R code files to guide users through the analysis workflow.
Reviewer #1 (Recommendations for the authors):
In Figure 3C (and similar plots of dosage response curves throughout the manuscript), we initially misinterpreted the plots because we assumed that the zero log fold change on the horizontal axis was in the middle of the plot. This gives the incorrect interpretation that the trans genes are insensitive to loss of GFI1B in Figure 3C, for instance. We think it may be helpful to add a line to mark the zero log fold change point, as was done in Figure 3A.
We thank the reviewer for this helpful suggestion. To improve clarity, we have added a vertical line marking the zero log fold change point in Figure 3C and all similar dosage-response plots. We agree this makes the plots easier to interpret at a glance.
Similarly, for heatmaps in the style of Figure 3B, it may be nice to have a column for the non-targeting controls, which should be a white column between the perturbations that increase versus decrease GFI1B.
We appreciate the suggestion. However, because all perturbation effects are computed relative to the non-targeting control (NTC) cells, explicitly including a separate column for NTC in the heatmap would add limited interpretive value and could unnecessarily clutter the figure. For clarity, we have emphasized in the figure legend that the fold changes are relative to the NTC baseline.
We found it challenging to assess the degree of uncertainty in the estimation of log fold changes throughout the paper. For example, the authors state the following on line 190: "We observed substantial differences in the effects of the same guide on the CRISPRi and CRISPRa backgrounds, with no significant correlation between cis gene fold-changes." This claim was challenging to assess because there are no horizontal or vertical error bars on any of the points in Figure 2A. If the log fold change estimates are very noisy, the data could be consistent with noisy observations of a correlated underlying process. Similarly, to our understanding, the dosage response curves are fit assuming that the cis log fold changes are fixed. If there is excessive noise in the estimation of these log fold changes, it may bias the estimated curves. It may be helpful to give an idea of the amount of estimation error in the cis log fold changes.
We agree that assessing the uncertainty in log fold change estimates is important for interpreting both the lack of correlation between CRISPRi and CRISPRa effects (Figure 2A) and the robustness of the dosage-response modeling.
In response, we have now updated Figure 2A to include both vertical and horizontal error bars, representing the standard errors of the log2 fold-change estimates for each guide in the CRISPRi and CRISPRa conditions. These error estimates were computed based on the differential expression analysis performed using the FindMarkers function in Seurat, which models gene expression differences between perturbed and control cells. We also now clarify this in the figure legend and methods.
The authors mention hierarchical clustering on line 313, which identified six clusters. Although a dendrogram is provided, these clusters are not displayed in Figure 4A. We recommend displaying these clusters alongside the dendrogram.
We have added colored bars indicating the clusters to improve the clarity. Thank you for the suggestion.
In Figures 4B and 4C, it was not immediately clear what some of the gene annotations meant. For example, neither the text nor the figure legend discusses what "WBCs", "Platelets", "RBCs", or "Reticulocytes" mean. It would be helpful to include this somewhere other than only the methods to make the figure more clear.
To improve clarity, we have updated the figure legends for Figures 4B and 4C to explicitly define these abbreviations.
We struggled to interpret Figure 4E. Although the authors focus on the association of MYB with pHaplo, we would have appreciated some general discussion about the pattern of associations seen in the figure and what the authors expected to observe.
We have changed the paragraph to add more exposition and clarification:
“The link between selective constraint and response properties is most apparent in the MYB trans network. Specifically, the probability of haploinsufficiency (pHaplo) shows a significant negative correlation with the dynamic range of transcriptional responses (Figure 4G): genes under stronger constraint (higher pHaplo) display smaller dynamic ranges, indicating that dosage-sensitive genes are more tightly buffered against changes in MYB levels. This pattern was not reproduced in the other trans networks (Figure 4E)”.
Line 71: potentially incorrect use of "rending" and incorrect sentence grammar.
Fixed
Line 123: "co-expression correlation across co-expression clusters" - authors may not have intended to use "co-expression" twice.
Original sentence was correct.
Line 246: "correlations" is used twice in "correlations gene-specific correlations."
Fixed.
Reviewer #2 (Recommendations for the authors):
(1) To show that the approach indeed allows gradual down-regulation it would be important to quantify the know-down strength with a single-cell readout for a subset of sgRNAs individually (e.g. flowfish/protein staining flow cytometry).
We agree that single-cell validation of knockdown strength using orthogonal approaches such as flowFISH or protein staining would provide additional support. However, such experiments fall outside the scope of the current study and are not feasible at this stage. We note that the observed transcriptomic changes and dosage responses across multiple perturbations are consistent with effective and graded modulation of gene expression.
(2) Similarly, an independent validation of the observed dose-response relationships, e.g. with individual sgRNAs, can be helpful to support the conclusions about non-linear responses.
Fig. S4C includes replication of trans-effects for a handful of guides used both in this study and in Morris et al. While further orthogonal validation of dose-response relationships would be valuable, such extensive additional work is not currently feasible within the scope of this study. Nonetheless, the high degree of replication in Fig. S4C as well as consistency of patterns observed across multiple sgRNAs and target genes provides strong support for the conclusions drawn from our high-throughput screen.
(3) The calculation of the log2 fold changes should be documented more precisely. To perform a pseudo-bulk analysis, the raw UMI counts should be summed up in each group (NTC, individual targeting sgRNAs), including zero counts, then the data should be normalized and the fold change should be calculated. The DESeq package for example would be useful here.
We have updated the methods in the manuscript to provide more exposition of how the logFC was calculated:
“In our differential expression (DE) analysis, we used Seurat’s FindMarkers() function, which computes the log fold change as the difference between the average normalized gene expression in each group on the natural log scale:
Logfc = log_e(mean(expression in group 1)) - log_e(mean(expression in group 2))
This is calculated in pseudobulk where cells with the same sgRNA are grouped together and the mean expression is compared to the mean expression of cells harbouring NTC guides. To calculate per-gene differential expression p-value between the two cell groups (cells with sgRNA vs cells with NTC), Wilcoxon Rank-Sum test was used”.
(4) A more careful characterization of the cell lines used would be helpful. First, it would be useful to include the quality controls performed when the clonal lines were selected, in the manuscript. Moreover, a transcriptome analysis in comparison to the parental cell line could be performed to show that the cell lines are comparable. In addition, it could be helpful to perform the analysis of the samples separately to see how many of the response behaviors would still be observed.
Details of the quality control steps used during the selection of the CRISPRa clonal line are already included in the Methods section, and Fig. S4A shows the transcriptome comparison of CRISPRi and CRISPRa lines also for non-targeting guides. Regarding the transcriptomic comparison with the parental cell line, we agree that such an analysis would be informative; however, this would require additional experiments that are not feasible within the scope of the current study. Finally, while analyzing the samples separately could provide further insight into response heterogeneity, we focused on identifying robust patterns across perturbations that are reproducible in our pooled screening framework. We believe these aggregate analyses capture the major response behaviors and support the conclusions drawn.
(5) In general we were surprised to see such strong responses in some of the trans genes, in some cases exceeding the fold changes of the cis gene perturbation more than 2x, even at the relatively modest cis gene perturbations (Figures S5-S8). How can this be explained?
This phenomenon—where trans gene responses can exceed the magnitude of cis gene perturbations—is not unique to our study. Similar effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022).
Several factors may contribute to this pattern. One possibility is that certain trans genes are highly sensitive to transcription factor dosage, and therefore exhibit amplified expression changes in response to relatively modest upstream perturbations. Transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Mechanistically, this may involve non-linear signal propagation through regulatory networks, in which intermediate regulators or feedback loops amplify the downstream transcriptional response. While our dataset cannot fully disentangle these indirect effects, the consistency of this observation across multiple studies suggests it is a common feature of transcriptional regulation in K562 cells.
(6) In the analysis shown in Figure S3B, the correlation between cells with zero count and >0 counts for the cis gene is calculated. For comparison, this analysis should also show the correlation between the cells with similar cis-gene expression and between truly different populations (e.g. NTC vs strong sgRNA).
The intent of Figure S3B was not to compare biologically distinct populations or perform differential expression analyses—which we have already conducted and reported elsewhere in the manuscript—but rather to assess whether fold change estimates could be biased by differences in the baseline expression of the target gene across individual cells. Specifically, we sought to determine whether cells with zero versus non-zero expression (as can result from dropouts or binary on/off repression from the KRAB-based CRISPRi system) exhibit systematic differences that could distort fold change estimation. As such, the comparisons suggested by the reviewer do not directly relate to the goal of the analysis which Figure S3B was intended to show.
(7) It is unclear why the correlation between different lanes is assessed as quality control metrics in Figure S1C. This does not substitute for replicates.
The intent of Figure S1C was not to serve as a general quality control metric, but rather to illustrate that the targeted transcript capture approach yielded consistent and specific signal across lanes. We acknowledge that this may have been unclear and have revised the relevant sentence in the text to avoid misinterpretation.
“We used the protein hashes and the dCas9 cDNA (indicating the presence or absence of the KRAB domain) to demultiplex and determine the cell line—CRISPRi or CRISPRa. Cells containing a single sgRNA were identified using a Gaussian mixture model (see Methods). Standard quality control procedures were applied to the scRNA-seq data (see Methods). To confirm that the targeted transcript capture approach worked as intended, we assessed concordance across capture lanes (Figure S1C)”.
(8) Figures and legends often miss important information. Figure 3B and S5-S8: what do the transparent bars represent? Figure S1A: color bar label missing. Figure S4D: what are the lines?, Figure S9A: what is the red line? In Figure S8 some of the fitted curves do not overlap with the data points, e.g. PKM. Fig. 2C: why are there more than 96 guide RNAs (see y-axis)?
We have addressed each point as follows:
Figure 3B: The figure legend has been updated to clarify the meaning of the transparent bars.
Figures S5–S8: There are no transparent bars in these figures; we confirmed this in the source plots.
Figure S1A: The color bar label is already described in the figure legend, but we have reformulated the caption text to make this clearer.
Figure S4D: The dashed line represents a linear regression between the x and y variables. The figure caption has been updated accordingly.
Figure S9A: We clarified that the red line shows the median ∆AIC across all genes and conditions.
Figure S8: We agree that some fitted curves (e.g., PKM) do not closely follow the data points. This reflects high noise in these specific measurements; as noted in the text, TET2 is not expected to exert strong trans effects in this context.
Figure 2C: Thank you for catching this. The y-axis numbers were incorrect because the figure displays the proportion of guides (summing to 100%), not raw counts. We have corrected the y-axis label and updated the numbers in the figure to resolve this inconsistency.
(9) The code is deposited on Github, but documentation is missing.
Documentation is included as inline comments within the R code files to guide users through the analysis workflow.
(10) The methods miss a list of sgRNA target sequences.
We thank the reviewer for this observation. A complete table containing all processed data, including the sequences of the sgRNAs used in this study, is available at the following GEO link:
(11) In some parts, the language could be more specific and/or the readability improved, for example:
Line 88: "quantitative landscape".
Changed to “quantitative patterns”.
Lines 88-91: long sentence hard to read.
This complex sentence was broken up into two simpler ones:
“We uncovered quantitative patterns of how gradual changes in transcription dosage lead to linear and non-linear responses in downstream genes. Many downstream genes are associated with rare and complex diseases, with potential effects on cellular phenotypes”.
Line 110: "tiling sgRNAs +/- 1000 bp from the TSS", could maybe be specified by adding that the average distance was around 100 or 110 bps?
Lines 244-246: hard to understand.
We struggle to see the issue here and are not sure how it can be reworded.
Lines 339-342: hard to understand.
These sentences have been reworded to provide more clarity.
(12) A number of typos, and errors are found in the manuscript:
Line 71: "SOX2" -> "SOX9".
FIXED
Line 73: "rending" -> maybe "raising" or "posing"?
FIXED
Line 157: "biassed".
FIXED
Line 245: "exhibited correlations gene-specific correlations with".
FIXED
Multiple instances, e.g. 261: "transgene" -> "trans gene".
FIXED
Line 332: "not reproduced with among the other".
FIXED
Figure S11: betweenness.
This is the correct spelling
There are more typos that we didn't list here.
We went through the manuscript and corrected all the spelling errors and typos.
eLife Assessment
This study presents a valuable tool named TSvelo, a computational framework for RNA velocity inference that models transcriptional regulation and gene-specific splicing. The evidence supporting the claims of the authors is solid, although elaboration of the computational benchmark and datasets would have strengthened the study. The work will be of interest to computational scientists working in the field of RNA biology.
Reviewer #1 (Public review):
Summary:
In the paper, the authors propose a new RNA velocity method, TSvelo, which predicts the transcription rate linearly based on the expression of RNA levels of transcription factors. This framework is an extension of its recent work TFvelo by including unspliced reads and designing a coherent neuralODE framework. Improved performance was demonstrated in six diverse datasets.
Strengths:
Overall, this method introduces innovative solutions to link cell differentiation and gene regulation, with a balance between model complexity (neuralODE) and interpretability (raw gene space).
Weaknesses:
While it seems to provide convincing results, there are multiple technical concerns for the authors to clarify and double-check.
(1) The authors should clarify and discuss the TF-target map: here, the TF-target genes map is predefined by the TF binding's ChIP-seq data. This annotation is largely incomplete and mostly compiled from a set of bulk tissues. Therefore, for a certain population, the TF-target relation may change. This requires clarification and discussion, possibly exploring how to address this in the model. In addition, a regulon database could be added, e.g., DoRothEA?
(2) The authors should clarify how example genes are selected. This is particularly unclear in Figure 2d.
(3) The authors should clarify confidence in the statement in lines 179-180, that ANXA4 should initially decrease. This is particularly concerning, as TSvelo didn't capture the cell cycle transitions well during the initial part.
(4) A support reference should be added for the statement in line 260 that "neuron migrations are inside-out manner". There is no reference supporting this, and this statement is critical for the model assessment.
(5) The comparison to scMultiomics data is particularly interesting, as MultiVelo uses ATAC data to predict the transcription rate. It would be very insightful to add a direct comparison of the estimated transcription rate between using ATAC and directly using TFs' RNA expressions.
(6) In Figure 6g, it should be clarified how the lineage was determined. Did the authors use the LARRY barcodes, predicted cell fate, or any other methods? Here, the best way is probably using the LARRY barcodes for individual clones.
Reviewer #2 (Public review):
Summary:
Li et al. propose TSvelo, a computational framework for RNA velocity inference that models transcriptional regulation and gene-specific splicing using a neural ODE approach. The method is intended to improve trajectory reconstruction and capture dynamic gene expression changes in scRNA-seq data. However, the manuscript in its current form falls short in several critical areas, including rigorous validation, quantitative benchmarking, clarity of definitions, proper use of prior knowledge, and interpretive caution. Many of the authors' claims are not fully supported by the evidence.
Major comments:
(1) Modeling comments
(a) Lines 512-513: How does the U-to-S delay validate the accuracy of pseudotime? Using only a single gene as an example is not sufficient for "validation."
(b) Lines 512-518: The authors propose a strategy for selecting the initial state, but do not benchmark how accurate this selection procedure is, nor do they provide sufficient rationale. While some genes may indeed exhibit U-to-S delay during lineage differentiation, why does the highest U-to-S delay score indicate the correct initiation states? Please provide mathematical justification and demonstrate accuracy beyond using a single gene example. Maybe a simulation with ground truth could help here, too.
(c) Equation (8): The formulation looks to be incorrect. If $$W \in \mathbb{R}^{G\times G}$$ and $$W' - \Gamma' \in \mathbb{R}^{K\times K}$$, how can they be aligned within the same row? Please clarify.
(d) The use of prior knowledge graphs from ENCODE or ChEA to constrain regulation raises concerns. Much of the regulatory information in these databases comes from cell lines. How can such cell-line-based regulation be reliably applied to primary tissues, as is done throughout the manuscript? Additional experiments are needed to test the robustness of TSvelo with respect to prior knowledge.
(e) Lines 579-580: How is the grid search performed? More methodological details are required. If an existing method was used, please provide a citation.
(2) Application on pancreatic endocrine datasets
(a) Lines 140-141: What is the definition of the final pseudotime-fitted time t or velocity pseudotime?
(b) Lines 143-144: The use of the velocity consistency metric to benchmark methods in multi-lineage datasets is incorrect. In multi-lineage differentiation systems, cells (e.g., those in fate priming stages) may inherently show inconsistency in their velocity. Thus, it is difficult to distinguish inconsistency caused by estimation error from that arising from biological signals. Velocity consistency metrics are only appropriate in systems with unidirectional trajectories (e.g., cell cycling). The abnormally high consistency values here raise concerns about whether the estimated velocities meaningfully capture lineage differences.
(c) The improvement of TSvelo over other methods in terms of cross-boundary direction correctness looks marginal; a statistical test would help to assess its significance.
(d) Lines 177-178: Based on the figure, TSvelo does not appear to clearly distinguish cell types. A quantitative metric, such as Adjusted Rand Index (ARI), should be provided.
(e) Lines 179-183: The claim that traditional methods cannot capture dynamics in the unspliced-spliced phase portrait is vague. What specific aspect is not captured-the fitted values or something else? Evidence is lacking. Please provide a detailed explanation and quantitative metrics to support this claim.
(3) Application to gastrulation erythroid datasets
(a) Lines 191-194: The observation that velocity genes are enriched for erythropoiesis-related pathways is trivial, since the analysis is restricted to highly variable genes (HVGs) from an erythropoiesis dataset. This enrichment is expected and therefore not informative.
(b) Lines 227-228: It remains unclear how TSvelo "accurately captures the dynamics." What is the definition of dynamics in this context? Figure 3g shows unspliced/spliced vs. fitted time plots and phase portraits, but without a quantitative definition or measure, the claim of superiority cannot be supported. Visualization of a single gene is insufficient; a systematic and quantitative analysis is needed.
(4) Application to the mouse brain and other datasets
(a) Lines 280-281: The authors cannot claim that velocity streams are smoother in TSvelo than in Multivelo based solely on 2D visualization. Similarly, claiming that one model predicts the correct differentiation trajectory from a 2D projection is over-interpretation, as has been discussed in prior literature see PMID: 37885016.
(b) Lines 304-306: Beyond transcriptional signal estimation, how is regulation inferred solely from scRNA-seq data validated, especially compared with scATAC-seq data? Are there cases where transcriptome-based regulatory inference is supported by epigenomic evidence, thereby demonstrating TSvelo's GRN inference accuracy?
(c) The claim that TSvelo can model multi-lineage datasets hinges on its use of PAGA for lineage segmentation, followed by independent modeling of dynamics within each subset. However, the procedure for merging results across subsets remains unclear.
Reviewer #3 (Public review):
Despite the abundance of RNA velocity tools, there are still major limitations, and there is strong skepticism about the results these methods lead to. In this paper, the authors try to address some limitations of current RNA velocity approaches by proposing a unified framework to jointly infer transcriptional and splicing dynamics. The method is then benchmarked on 6 real datasets against the most popular RNA velocity tools.
While the approach has the potential to be of interest for the field, and may present improvements compared to existing approaches, there are some major limitations that should be addressed, particularly concerning the benchmark (see major comment 1).
Major comments:
(1) My main criticism concerns the benchmarking: real data lack a ground truth, and are absolutely not ideal for comparing methods, because one can only speculate what results appear to be more plausible.<br /> A solid and extensive simulation study, which covers various scenarios and possibly distinct data-generating models, is needed for comparing approaches. The authors should check, for example, the simulation studies in the BayVel approach (Section 4, BayVel: A Bayesian Framework for RNA Velocity Estimation in Single-Cell Transcriptomics). Clearly, all methods should be included in the simulation.
(2) Related to the above: since a ground truth is missing, the real data analyses need to be interpreted with caution. I recommend avoiding strong statements, such as "successfully captures the correct gene dynamics", or "accurately infer", in favour of milder statements supported by the data, such as "... aligns with the biological processes described" (as in page 12), or "results are compatible with current biological knowledge", etc...
(3) Many methods perform RNA velocity analyses. While there is a brief description, I think it'd be useful to have a schematic summary (e.g., via a Table) of the main conceptual, mathematical, and computational characteristics of each approach.
(4) Related to the above: I struggled to identify the main conceptual novelty of TSvelo, compared to existing approaches. I recommend explaining this aspect more extensively.
(5) A computational benchmark is missing; I'd appreciate seeing the runtime and memory cost of all methods in a couple of datasets.
(6) I think BayVel (mentioned above) should be added to the list of competing methods (both in the text and in the benchmarks). The package can be found here: https://github.com/elenasabbioni/BayVel_pkgJulia .
Author response:
Reviewer #1:
We appreciate the reviewer’s positive assessment of TSvelo and their helpful technical comments. In the revised manuscript, we will:
(1) Provide a clearer discussion of TF–target annotations, their limitations, and potential integration of additional databases.
(2) Clarify the rationale for example-gene selection (e.g., in Fig. 2d).
(3) Re-evaluate and temper the interpretation regarding ANXA4 and early-stage cell-cycle transitions.
(4) Add appropriate references supporting neuronal inside-out migration.
(5) Include additional analysis comparing TF-based transcription rate estimation with ATAC-based estimates from MultiVelo.
(6) Clarify how lineages were determined in Fig. 6g and incorporate barcode-based validation where applicable.
(7) Correct all typographical errors noted.
Reviewer #2:
We appreciate the reviewer’s careful examination of modeling, benchmarking, and interpretation. To address these concerns, we will:
(1) Expand the methodological justification for initial-state selection, add simulations with ground truth, and evaluate U-to-S delay more broadly across genes.
(2) Clarify matrix formulations and ensure consistency in notation (e.g., Eq. 8).
(3) Assess robustness to prior-knowledge graphs and evaluate alternatives beyond ENCODE/ChEA.
(4) Add methodological details on parameter search.
(5) Improve benchmarking on pancreatic endocrine datasets by including clear definitions of velocity pseudotime, ARI for cell-type separation, quantitative evaluation of phase-portrait fits, and appropriate interpretation of consistency metrics for multi-lineage systems.
(6) Reframe claims about “accurate” or “correct” predictions where evidence is qualitative and strengthen quantitative support where possible.
(8) Clarify lineage segmentation and merging when applying PAGA-guided multi-lineage modeling.
Reviewer #3:
We thank the reviewer for highlighting the need for more rigorous benchmarking and conceptual clarity. In response, we will:
(1) Conduct an expanded simulation study incorporating different data-generating models.
(2) Revise all strong claims to more cautious, evidence-based language.
(3) Add a concise table summarizing conceptual and computational differences among RNA-velocity frameworks.
(4) More clearly articulate the conceptual novelty of TSvelo relative to existing approaches.
(5) Include runtime and memory benchmarks across representative datasets.
(6) Explore additional methods in conceptual comparisons and benchmarking analyses.We appreciate the reviewers’ thoughtful input and agree that the suggested analyses and clarifications will significantly improve the rigor and clarity of the manuscript. We will incorporate all recommended revisions in the resubmission and provide a full, detailed, point-by-point response at that time.
eLife Assessment
This valuable study investigates the role of P-bodies in yeast proliferation and mRNA regulation within the phyllosphere, proposing that P-body assembly contributes to methanol metabolism and stress adaptation. The findings are of interest to researchers studying post-transcriptional gene regulation and microbial ecology in plants. However, the evidence is incomplete, as most experiments were performed under artificial conditions, relied on limited genetic validation, and were supported primarily by qualitative or low-resolution imaging.
Reviewer #1 (Public review):
Summary:
Stemming from the previous research on the adaptation of methylotrophic microbes in the phyllosphere environment, this paper tested a novel hypothesis on the molecular and cellular mechanisms by which yeast uses biomolecular condensates as unique niches for the regulation of methanol-induced mRNAs. While a few in vivo experiments were conducted in the phyllosphere, more assays were carried out on plates to mimic various stress conditions, diminishing the reliability of the conclusions in supporting the main hypothesis.
Strengths:
This study addressed an interesting and important biological question. Some of the experiments were conducted methodically and carefully. The visualization of both the biomolecular condensates and the mRNAs was helpful in addressing the questions. The results are expected to be useful in paving the way for the future study to directly test its main hypothesis. The results of this study could also have a general implication for the adaptation of a huge population of microbes in the enormous space of the phyllosphere on Earth.
Weaknesses:
The results were often over- and misinterpreted. Given mthat any hypotheses were tested indirectly on plates, the correlative results could only be used to carefully suggest the likelihood of the hypotheses. For example, a single edc3 mutant was used to represent a P-body-defective strain, although it is well known that EDC3 is a critical component in mRNA decapping; hence, the mutant should display a pleiotropic phenotype, rather than a mere reduced P-body phenotype. Using a similar reductionist approach, the study went on to employ a series of plate assays to argue that the conditions were mimicking the phyllosphere, which could be misleading under these circumstances. Furthermore, the low percentage of the colocalization between P-bodies and mimRNA granules and the similar results from negative control mRNAs do not convincingly support the idea that mimRNAs are sequestered between two biomolecular condensates, and P-bodies could serve as regulatory hubs. Given that the abundance of mimRNA granules was positively correlated with the transcript abundance of mimRNAs, and P-body abundance did not change too much under methanol induction, the results could not support an active mimRNA sequestration mechanism from mimRNA granules to P-bodies with a proportional increase of the overlap between the two condensates. More direct experiments conducted in the phyllosphere using multiple P-body defective yeast strains should strengthen the manuscript, assuming all the results turned out to be supportive.
Reviewer #2 (Public review):
Summary:
This article aims to elucidate the potential roles of P-bodies in yeast adaptation to complex environmental conditions, such as the plant leaf phyllosphere. The authors demonstrated that yeast mutants defective in one of the P-body-localized proteins failed to grow in the Arabidopsis thaliana phyllosphere. They conducted detailed imaging analyses, focusing particularly on the co-localization of P-bodies and mRNAs (DAS1) related to the methanol metabolism pathway under various environmental conditions. The study newly revealed that these mRNAs form dot-like structures that occasionally co-localize with a P-body marker. Furthermore, the authors showed that the number of P-body-labeled dots increases under stress conditions, such as H₂O₂ treatment, and that mRNA dots are more frequently localized to P-body-like structures. Based on these detailed observations, the authors hypothesize that P-bodies function to protect mRNAs from degradation, particularly under stress conditions.
Strengths:
I think the authors' attempt to elucidate the potential roles of P-bodies in yeast under stress conditions is novel, and the imaging data are overall very nice.
Weaknesses:
I believe the authors could make additional efforts to more clearly demonstrate that P-bodies are indeed required for yeast proliferation in the phyllosphere, as described below, since this represents the most novel aspect of the study.
Reviewer #3 (Public review):
Summary:
The authors use fluorescent microscopy and fluorescent markers to investigate the requirement of P-bodies during growth on methanol, a common substrate available on plant leaves, by using a yeast edc3 mutant defective in P-body formation. Growth on methanol upregulates the transcription of methanol metabolic genes, which accumulate in granular structures, as observed by microscopy. Co-localization of P-bodies and granules was quantified and described as dynamically enhanced during oxidative stress. Ultimately, the authors suggest a model where methanol induces the accumulation of methanol-induced mRNAs in cytosolic granules, which dynamically interact with P-bodies, especially during oxidative stress, to protect the mRNAs from degradation. However, this model is not strongly supported by the provided data, as the quantification of the co-localization between different markers (of organelles and between P-body and granules) is not well presented or described in the text.
Considering that there is only a small EDC3-dependent overlap between P-bodies and mimRNA granules, the claim that P-bodies regulate mimRNAs is not fully justified. Rather, EDC3 could also be involved in mimRNA granule formation, independent of P-bodies.
Strengths:
(1) The authors could show convincingly that P-bodies (using a P-body-deficient edc3-KO strain) are important for colonizing the plant phyllosphere and for the regulation of methanol-induced mRNAs (mimRNA).
(2) The visualization of mimRNA granules and P-bodies using fluorescent markers is interesting and was validated by alternative methods, such as FISH staining.
(3) The dynamic formation of mimRNA granules and P-bodies was demonstrated during growth on leaves and in artificial medium during oxidative stress. The mimRNA granules showed a similar dynamic as the abundances of several mimRNAs and their corresponding proteins.
(4) A role of EDC3 in the formation of mimRNA granules was demonstrated. However, the link between P-bodies and mimRNA granules was not clearly shown.
Weaknesses:
(1) The study largely relies on fluorescent microscopy and co-localization measurements. However, the subcellular resolution is not very high; it is unclear how dot-like structures were measured and, importantly, how co-localization was quantified.
(2) The text does not clarify to what degree P-bodies and mimRNA granules are different structures. Based on the images, the size of P-bodies and granules seems to be vastly different, making it unclear whether these structures are fused or separate, even if their markers are reported to overlap.
(3) The evidence that mimRNA granules contain ribosome-free and ribosome-associated RNA is only based on inhibitors and microscopy, without providing further evidence measuring granule content by isolation and sequencing approaches.
(4) Similarly, the co-localization with other organelle markers is not supported by quantitative data.
eLife Assessment
This fundamental study presents experimental evidence on how geomagnetic and visual cues are integrated in a nocturnally migrating insect. The evidence supporting the conclusions is compelling. The work will be of broad interest to researchers studying animal migration and navigation.
Reviewer #1 (Public review):
Summary
The manuscript by Ma et al. provides robust and novel evidence that the noctuid moth Spodoptera frugiperda (Fall Armyworm) possesses a complex compass mechanism for seasonal migration that integrates visual horizon cues with Earth's magnetic field (likely its horizontal component). This is an important and timely study: apart from the Bogong moth, no other nocturnal Lepidoptera has yet been shown to rely on such a dual-compass system. The research therefore expands our understanding of magnetic orientation in insects with both theoretical (evolution and sensory biology) and applied (agricultural pest management, a new model of magnetoreception) significance.
The study uses state-of-the-art methods and presents convincing behavioural evidence for a multimodal compass. It also establishes the Fall Armyworm as a tractable new insect model for exploring the sensory mechanisms of magnetoreception, given the experimental challenges of working with migratory birds. Overall, the experiments are well-designed, the analyses are appropriate, and the conclusions are generally well supported by the data.
Strengths
(1) Novelty and significance: First strong demonstration of a magnetic-visual compass in a globally relevant migratory moth species, extending previous findings from the Bogong moth and opening new research avenues in comparative magnetoreception.
(2) Methodological robustness: Use of validated and sophisticated behavioural paradigms and magnetic manipulations consistent with best practices in the field. The use of 5-minute bins to study the dynamic nature of the magnetic compass which is anchored to a visual cue but updated with a latency of several minutes, is an important finding and a new methodological aspect in insect orientation studies.
(3) Clarity of experimental logic: The cue-conflict and visual cue manipulations are conceptually sound and capable of addressing clear mechanistic questions.
(4) Ecological and applied relevance: Results have implications for understanding migration in an invasive agricultural pest with an expanding global range.
(5) Potential model system: Provides a new, experimentally accessible species for dissecting the sensory and neural bases of magnetic orientation.
Weaknesses
While the study is strong overall, several recommendations should be addressed to improve clarity, contextualisation, and reproducibility:
(1) Structure and presentation of results
Requires reordering the visual-cue experiments to move from simpler (no cues) to more complex (cue-conflict) conditions, improving narrative logic and accessibility for non-specialists.
(2) Ecological interpretation
(a) The authors should discuss how their highly simplified, static cue setup translates to natural migratory conditions where landmarks are dynamic, transient or absent.
(b) Further consideration is required regarding how the compass might function when landmarks shift position, are obscured, or are replaced by celestial cues. Also, more consolidated (one section) and concrete suggestions for future experiments are needed, with transient, multiple, or more naturalistic visual cues to address this.
(3) Methodological details and reproducibility
(a) It would be better to move critical information (e.g., electromagnetic noise measurements) from the supplementary material into the main Methods.
(b) Specifying luminance levels and spectral composition at the moth's eye is required for all visual treatments.
(c) Details are needed on the sex ratio/reproductive status of tested moths, and a map of the experimental site and migratory routes (spring vs. fall) should be included.
(d) Expanding on activity-level analyses is required, replacing "fatigue" with "reduced flight activity," and clarifying if such analyses were performed.
(4) Figures and data presentation
(a) The font sizes on circular plots should be increased; compass labels (magnetic North), sample sizes, and p-values should be included.
(b) More clarity is required on what "no visual cue" conditions entail, and schematics or photos should be provided.
(c) The figure legends should be adjusted for readability and consistency (e.g., replace "magnetic South" with magnetic North, and for box plots better to use asterisks for significance, report confidence intervals).
(5) Conceptual framing and discussion
(a) Generalisations across species should be toned down, given the small number of systems tested by overlapping author groups.
(b) It requires highlighting that, unlike some vertebrates, moths require both magnetic and visual cues for orientation.
(c) It should be emphasised that this study addresses direction finding rather than full navigation.
(d) Future Directions should be integrated and consolidated into one coherent subsection proposing realistic next steps (e.g., more complex visual environments, temporal adaptation to cue-field relationships).
(e) The limitations should be better discussed, due to the artificiality of the visual cue earlier in the Discussion.
(6) Technical and open-science points
• Appropriate circular statistics should be used instead of t-tests for angular data shown in the supplementary material.
• Details should be provided on light intensities, power supplies, and improvements to the apparatus.
• The derivation of individual r-values should be clarified.
• Share R code openly (e.g., GitHub).
• Some highly relevant - yet missing - recent and relevant citations should be added, and some less relevant ones removed.
Reviewer #2 (Public review):
Summary:
This work provided experimental evidence on how geomagnetic and visual cues are integrated, and visual cues are indispensable for magnetic orientation in the nocturnal fall armyworm.
Strengths:
Although it has been demonstrated previously that the Australian Bogon moth could integrate global stellar cues with the geomagnetic field for long-distance navigation, the study presented in this manuscript is still fundamentally important to the field of magnetoreception and sensory biology. It clearly shows that the integration of geomagnetic and visual cues may represent a conserved navigational mechanism broadly employed across migratory insects. I find the research very important, and the results are presented very well.
Weaknesses:
The authors developed an indoor experimental system to study the influence of magnetic fields and visual cues on insect orientation, which is certainly a valuable approach for this field. However, the ecological relevance of the visual cue may be limited or unclear based on the current version. The visual cues were provided "by a black isosceles triangle (10 cm high, 10 cm 513 base) made from black wallpaper and fixed to the horizon at the bottom of the arena". It is difficult to conceive how such a stimulus (intended to represent a landmark like a mountain) could provide directional information for LONG-DISTANCE navigation in nocturnal fall armyworms, particularly given that these insects would have no prior memory of this specific landmark. It might be a good idea to make a more detailed explanation of this question.
eLife Assessment
This important work introduces a family of interpretable Gaussian process models that allows us to learn and model sequence-function relationships in biomolecules. These models are applied to three recent empirical fitness landscapes, providing convincing evidence of their predictive power. The findings should be of interest to the community working on the sequence-function relationship, on epistasis, and on fitness landscapes.
Reviewer #1 (Public review):
Summary:
Zhou and colleagues introduce a series of generalized Gaussian process models for genotype-phenotype mapping. The goal was to develop models that were more powerful than standard linear models, while retaining explanatory power as opposed to neural network approaches. The novelty stems from choices of prior distributions (and I suppose fitted posteriors) that model epistasis based on some form of site/allele-specific modifier effect and genotype distance. The authors then apply their models to three empirical datasets, the GB1 antibody-binding dataset, the human 5' splice set dataset, and a yeast meiotic cross dataset, and find substantially improved variance explained while retaining strong explanatory power when compared to linear models.
Strengths:
The main strength of the manuscript lies in the development of the modeling approaches, as well as the evidence from the empirical dataset that the variance explained is improved.
Weaknesses:
The main weakness of the paper is that none of the models were tested on an in silico dataset where the ground truth is known. Therefore, it is unclear if their model actually retains any explanatory power.
Impact:
Genotype-phenotype mapping is a central point of genetics. However, the function is complex and unknown. Simple linear models can uncover some functional link between genes and their effects, but do so through severe oversimplification of the system. On the other hand, neural networks can, in principle, model the function perfectly, but it does so without easy interpretation. Gaussian regression is another approach that improves on linear regression, allowing better fitting of the data while allowing interpretation of the underlying alleles and their effects. This approach, now computable with state-of-the-art algorithms, will advance the field of genotype-to-phenotype associations.
Reviewer #2 (Public review):
This paper builds on prior work by some of the same authors on how to model fitness landscapes in the presence of epistasis. They have previously shown how simply writing general expansions of fitness in terms of one-body plus two-body plus three-body, etc., terms often fails to generalize to good predictions. They have also previously introduced a Gaussian process regression approach regarding how much epistasis there should be of each order.
This paper contains several main advances:
(1) They implement a more efficient form of the Gaussian process model fitting that uses GPUs and related algorithmic advances to enable better fitting of these models to datasets for larger sequences.
(2) They provide a software package implementing the above.
(3) They generalize the models to allow the extent of epistasis associated with changes in sequence to depend on specific sites, alleles, and mutations.
(4) They show modest improvements in prediction and substantial improvements in interpretability with the more generalized models above.
Overall, while this paper is quite technical, my assessment is that it represents a substantial conceptual and algorithmic advance for the above reasons, and I would recommend only modest revisions. The paper seems well-written and clear, given the inherent complexity of this topic.
Reviewer #3 (Public review):
Summary:
The authors propose three types of Gaussian process kernels that extend and generalize standard kernels used for sequence-function prediction tasks, giving rise to the connectedness, Jenga, and general product models. The associated hyperparameters are interpretable and represent epistatic effects of varying complexity. The proposed models significantly outperform the simpler baselines, including the additive model, pairwise interaction model, and Gaussian process with a geometric kernel, in terms of R^2.
Strengths:
(1) The demonstrated performance boost and improved scaling with increasing training data are compelling.
(2) The hyperparameter selection step using the marginal likelihood, as implemented by the authors, seems to yield a reasonable hyperparameter combination that lends itself to biologically plausible interpretations.
(3) The proposed kernels generalize existing kernels in domain-interpretable ways, and can correspond to cases that would not be "physical" in the original models (e.g., $\mu_p>1$ in the original connectedness model that allows modeling of anticorrelated phenotypes).
Weaknesses:
(1) While enabling uncertainty quantification is a key advantage of Gaussian processes, the authors do not present metrics specific to the predicted uncertainties; all metrics seem to concern the mean predictions only. It would be helpful to evaluate coverage metrics and maybe include an application of the uncertainties, such as in active learning or Bayesian optimization.
(2) The more complex models, like the general product model, place a heavier burden on the hyperparameter selection step. Explicitly discussing the optimization routine used here would be helpful to potential users of the method and code.
eLife Assessment
This important study describes a novel Bayesian psychophysical approach that efficiently measures how well humans can discriminate between colors across the entire isoluminant plane. The evidence was considered compelling, as it included successful model validation against hold-out data and published datasets. This approach could prove to be of use to color vision scientists, as well as to those who use computational psychophysics and attempt to model perceptual stimulus fields with smooth variations over coordinate spaces.
Reviewer #1 (Public review):
Summary:
This paper presents an ambitious and technically impressive attempt to map how well humans can discriminate between colours across the entire isoluminant plane. The authors introduce a novel Wishart Process Psychophysical Model (WPPM) - a Bayesian method that estimates how visual noise varies across colour space. Using an adaptive sampling procedure, they then obtain a dense set of discrimination thresholds from relatively few trials, producing a smooth, continuous map of perceptual sensitivity. They validate their procedure by comparing actual and predicted thresholds at an independent set of sample points. The work is a valuable contribution to computational psychophysics and offers a promising framework for modelling other perceptual stimulus fields more generally.
Strengths:
The approach is elegant and well-described (I learned a lot!), and the data are of high quality. The writing throughout is clear, and the figures are clean (elegant in fact) and do a good job of explaining how the analysis was performed. The whole paper is tremendously thorough, and the technical appendices and attention to detail are impressive (for example, a huge amount of data about calibration, variability of the stim system over time, etc). This should be a touchstone for other papers that use calibrated colour stimuli.
Weaknesses:
Overall, the paper works as a general validation of the WPPM approach. Importantly, the authors validate the model for the particular stimuli that they use by testing model predictions against novel sample locations that were not part of the fitting procedure (Figure 2). The agreement is pretty good, and there is no overall bias (perhaps local bias?), but they do note a statistically-significant deviation in the shape of the threshold ellipses. The data also deviate significantly from historical measurements, and I think the paper would be considerably stronger with additional analyses to test the generality of its conclusions and to make clearer how they connect with classical colour vision research. In particular, three points could use some extra work:
(1) Smoothness prior.<br /> The WPPM assumes that perceptual noise changes smoothly across colour space, but the degree of smoothness (the eta parameter) must affect the results. I did not see an analysis of its effects - it seems to be fixed at 0.5 (line 650). The authors claim that because the confidence intervals of the MOCS and the model thresholds overlap (line 223), the smoothing is not a problem, but this might just be because the thresholds are noisy. A systematic analysis varying this parameter (or at least testing a few other values), and reporting both predictive accuracy and anisotropy magnitude, would clarify whether the model's smoothness assumption is permitting or suppressing genuine structure in the data. Is the gamma parameter also similarly important? In particular, does changing the underlying smoothness constraint alter the systematic deviation between the model and the MOCS thresholds? The authors have thought about this (of course! - line 224), but also note a discrepancy (line 238). I also wonder if it would be possible to do some analysis on the posterior, which might also show if there are some regions of color space where this matters more than others? The reason for doing this is, in part, motivated by the third point below - it's not clear how well the fits here agree with historical data.
(2) Comparison with simpler models. It would help to see whether the full WPPM is genuinely required. Clearly, the data (both here and from historical papers) require some sort of anisotropy in the fitting - the sensitivities decrease as the stimuli move away from the adaptation point. But it's >not< clear how much the fits benefit from the full parameterisation used here. Perhaps fits for a small hierarchy of simpler models - starting with isotropic Gaussian noise (as a sort of 'null baseline') and progressing to a few low-dimensional variants - would reveal how much predictive power is gained by adding spatially varying anisotropy. This would demonstrate that the model's complexity is justified by the data.
(3) Quantitative comparison to historical data. The paper currently compares its results to MacAdam, Krauskopf & Karl, and Danilova & Mollon only by visual inspection. It is hard to extract and scale actual data from historical papers, but from the quality of the plotting here, it looks like the authors have achieved this, and so quantitative comparisons are possible. The MacAdam data comparisons are pretty interesting - in particular, the orientations of the long axes of the threshold ellipses do not really seem to line up between the two datasets - and I thought that the orientation of those ellipses was a critical feature of the MacAdam data. Quantitative comparisons (perhaps overall correlations, which should be immune to scaling issues, axis-ratio, orientation, or RMS differences) would give concrete measures of the quality of the model. I know the authors spend a lot of time comparing to the CIE data, and this is great.... But re-expressing the fitted thresholds in CIE or DKL coordinates, and comparing them directly with classical datasets, would make the paper's claims of "agreement" much more convincing.
Overall, this is a creative and technically sophisticated paper that will be of broad interest to vision scientists. It is probably already a definitive methods paper showing how we can sample sensitivity accurately across colour space (and other visual stimulus spaces). But I think that until the comparison with historical datasets is made clear (and, for example, how the optimal smoothness parameters are estimated), it has slightly less to tell us about human colour vision. This might actually be fine - perhaps we just need the methods?
Related to this, I'd also note that the authors chose a very non-standard stimulus to perform these measurements with (a rendered 3D 'Greebley' blob). This does have the advantage of some sort of ecological validity. But it has the significant >disadvantage< that it is unlike all the other (much simpler) stimuli that have been used in the past - and this is likely to be one of the reasons why the current (fitted) data do not seem to sit in very good agreement with historical measurements.
Reviewer #2 (Public review):
Summary:
Hong et al. present a new method that uses a Wishart process to dramatically increase the efficiency of measuring visual sensitivity as a function of stimulus parameters for stimuli that vary in a multidimensional space. Importantly, they have validated their model against their own hold-out data and against 3 published datasets, as well as against colour spaces aimed at 'perceptual uniformity' by equating JNDs. Their model achieves high predictive success and could be usefully applied in colour vision science and psychophysics more generally, and to tackle analogous problems in neuroscience featuring smooth variation over coordinate spaces.
Strengths:
(1) This research makes a substantial contribution by providing a new method to very significantly increase the efficiency with which inferences about visual sensitivity can be drawn, so much so that it will open up new research avenues that were previously not feasible. Secondly, the methods are well thought out and unusually robust. The authors made a lot of effort to validate their model, but also to put their results in the context of existing results on colour discrimination, transforming their results to present them in the same colour spaces as used by previous authors to allow direct comparisons. Hold-out validation is a great way to test the model, and this has been done for an unusually large number of observers (by the standards of colour discrimination research). Thirdly, they make their code and materials freely available with the intention of supporting progress and innovation. These tools are likely to be widely used in vision science, and could of course be used to address analogous problems for other sensory modalities and beyond.
Weaknesses:
It would be nice to better understand what constraints the choice of basis functions puts on the space of possible solutions. More generally, could there be particular features of colour discrimination (e.g., rapid changes near the white point) that the model captures less well? The substantial individual differences evident in Figure S20 (comparison with Krauskopf and Gegenfurtner, 1992) are interesting in this context. Some observers show radial biases for the discrimination ellipses away from the white point, some show biases along the negative diagonal (with major axes oriented parallel to the blue-yellow axis), and others show a mixture of the two biases. Are these genuine individual differences, or could the model be performing less accurately in this desaturated region of colour space?
Reviewer #3 (Public review):
Summary:
This study presents a powerful and rigorous approach for characterizing stimulus discriminability throughout a sensory manifold, and is applied to the specific context of predicting color discrimination thresholds across the chromatic plane.
Strengths:
Color discrimination has played a fundamental role in studies of human color vision and for color applications, but as the authors note, it remains poorly characterized. The study leverages the assumption that thresholds should vary smoothly and systematically within the space, and validates this with their own tests and comparisons with previous studies.
Weaknesses:
The paper assumes that threshold variations are due to changes in the level of intrinsic noise at different stimulus levels. However, it's not clear to me why they could not also be explained by nonlinearities in the responses, with fixed noise. Indeed, most accounts of contrast coding (which the study is at least in part measuring because the presentation kept the adapt point close to the gray background chromaticity, and thus measured increment thresholds), assume a nonlinear contrast response function, which can at least as easily explain why the thresholds were higher for colors farther from the gray point. It would be very helpful if a section could be added that explains why noise differences rather than signal differences are assumed and how these could be distinguished. If they cannot, then it would be better to allow for both and refer to the variation in terms of S/N rather than N alone.
Related to this point, the authors note that the thresholds should depend on a number of additional factors, including the spatial and temporal properties and the state of adaptation. However, many of these again seem to be more likely to affect the signal than the noise.
An advantage of the approach is that it makes no assumptions about the underlying mechanisms. However, the choice to sample only within the equiluminant plane is itself a mechanistic assumption, and these could potentially be leveraged for deciding how to sample to improve the characterization and efficiency. For example, given what we know about early color coding, would it be more (or less) efficient to select samples based on a DKL space, etc?
eLife Assessment
This valuable study demonstrates that self-motion strongly affects neural responses to visual stimuli, comparing humans moving through a virtual environment to passive viewing. However, evidence that the modulation is due to prediction is incomplete as it stands, since participants may come to expect visual freezes over the course of the experiment. This study bridges human and rodent studies on the role of prediction in sensory processing, and is therefore expected to be of interest to a large community of neuroscientists.
Reviewer #1 (Public review):
In this paper, the authors wished to determine human visuomotor mismatch responses in EEG in a VR setting. Participants were required to walk around a virtual corridor, where a mismatch was created by halting the display for 0.5s. This occurred every 10-15 seconds. They observe an occipital mismatch signal at 180 ms. They determine the specificity of this signal to visuomotor mismatch by subsequently playing back the same recording passively. They also show qualitatively that the mismatch response is larger than one generated in a standard auditory oddball paradigm. They conclude that humans therefore exhibit visuomotor mismatch responses like mice, and that this may provide an especially powerful paradigm for studying prediction error more generally.
Asking about the role of visuomotor prediction in sensory processing is of fundamental importance to understanding perception and action control, but I wasn't entirely sure what to conclude from the present paradigm or findings. Visuomotor prediction did not appear to have been functionally isolated. I hope the comments below are helpful.
(1) First, isolating visuomotor prediction by contrasting against a condition where the same video stream is played back subsequently does not seem to isolate visuomotor prediction. This condition always comes second, and therefore, predictability (rather than specifically visuomotor predictability) differs. Participants can learn to expect these screen freezes every 10-15 s, even precisely where they are in the session, and this will reduce the prediction error across time. Therefore, the smaller response in the passive condition may be partly explained by such learning. It's impossible to fully remove this confound, because the authors currently play back the visual specifics from the visuomotor condition, but given that the visuomotor correspondences are otherwise pretty stable, they could have an additional control condition where someone else's visual trace is played back instead of their own, and order counterbalanced. Learning that the freezes occur every 10-15 s, or even precisely where they occur, therefore, could not explain condition differences. At a minimum, it would be nice to see the traces for the first and second half of each session to see the extent to which the mismatch response gets smaller. This won't control for learning about the specific separations of the freezes, but it's a step up from the current information.
(2) Second, the authors admirably modified their visual-only condition to remove nausea from 6 df of movement (3D position, pitch, yaw, and roll). However, despite the fact it's far from ideal to have nauseous participants, it would appear from the figures that these modifications may have changed the responses (despite some pairwise lack of significance with small N). Specifically, the trace in S3 (6DOF) and 2E look similar - i.e., comparing the visuomotor condition to the visual condition that matches. Mismatch at 4/5 microvolts in both. Do these significantly differ from each other?
(3) It generally seems that if the authors wish to suggest that this paradigm can be used to study prediction error responses, they need to have controlled for the actions performed and the visual events. This logic is outlined in Press, Thomas, and Yon (2023), Neurosci Biobehav Rev, and Press, Kok, and Yon (2020) Trends Cogn Sci ('learning to perceive and perceiving to learn'). For example, always requiring Ps to walk and always concurrently playing similar visual events, but modifying the extent to which the visual events can be anticipated based on action. Otherwise, it seems more accurately described as a paradigm to study the influence of action on perception, which will be generated by a number of intertwined underlying mechanisms.
More minor points:
(1) I was also wondering whether the authors may consider the findings in frontal electrodes more closely. Within the statistical tests of the frontal electrodes against 0, as displayed in Figure 3c, the insignificance of the effect of Fp2 seems attributable to the small included sample size of just 13 participants for this electrode, as listed in Table S1, in combination with a single outlier skewing the result. The small sample size stands out especially in comparison to the sample size at occipital electrodes, which is double and therefore enjoys far more statistical power. It looks like the selected time window is not perfectly aligned for determining a frontal effect, and also the distribution in 3B looks like responses are absent in more central electrodes but present in occipital and frontal ones. I realise the focus of analysis is on visual processing, but there are likely to be researchers who find the frontal effect just as interesting.
(2) It is claimed throughout the manuscript that the 'strongest predictor (of sensory input) - by consistency of coupling - is self-generated movement'. This claim is going to be hard to validate, and I wonder whether it might be received better by the community to be framed as an especially strong predictor rather than necessarily the strongest. If I hear an ambulance siren, this is an especially strong predictor of subsequent visual events. If I see a traffic light turn red, then yellow, I can be pretty certain what will happen next. Etc.
(3) The checkerboard inversion response at 48 ms is incredibly rapid. Can the authors comment more on what may drive this exceptionally fast response? It was my understanding that responses in this time window can only be isolated with human EEG by presenting spatially polarized events (cf. c1, e.g., Alilovic, Timmermans, Reteig, van Gaal, Slagter, 2019, Cerebral Cortex)
Reviewer #2 (Public review):
Summary:
This study investigates whether visuomotor mismatch responses can be detected in humans. By adapting paradigms from rodent studies, the authors report EEG evidence of mismatch responses during visuomotor conditions and compare them to visual-only stimulation and mismatch responses in other modalities.
Strengths:
(1) The authors use a creative experimental design to elicit visuomotor mismatch responses in humans.
(2) The study provides an initial dataset and analytical framework that could support future research on human visuomotor prediction errors.
Weaknesses:
(1) Methodological issues (e.g., volume conduction, channel selection, lack of control for eye movements) make it difficult to confidently attribute the observed mismatch responses to activity in visual cortical regions.
(2) A very large portion of the data was excluded due to motion artefacts, raising concerns about statistical power and representativeness. The criteria for trial inclusion and the number of accepted trials per participant appear arbitrary and not justified with reference to EEG reliability standards.
(3) The comparison across sensory modalities (e.g., auditory vs. visual mismatch responses) is conceptually interesting, but due to the choice of analyzing auditory mismatch responses over occipital channels, it has limited interpretability.
The authors successfully demonstrate that visuomotor mismatch paradigms can, in principle, be applied in human EEG. However, due to the issues outlined above, the current findings are relatively preliminary. If validated with improved methodology, this approach could significantly advance our understanding of predictive processing in the human visual system and provide a translational bridge between rodent and human work.
Reviewer #3 (Public review):
Summary:
Solyga, Zelechowski, and Keller present a concise report of an innovative study demonstrating clear visuomotor mismatch responses in ambulating humans, using a mobile EEG setup and virtual reality. Human subjects walked around a virtual corridor while EEGs were recorded. Occasionally, motion and visual flow were uncoupled, and this evoked a mismatch response that was strongest in occipitally placed electrodes and had a considerable signal-to-noise ratio. It was robust across participants and could not be explained by the visual stimulus alone.
Strengths:
This is an important extension of their prior work in mice, and represents an elegant translation of those previous findings to humans, where future work can inform theories of e.g., psychiatric diseases that are believed to involve disordered predictive processing. For the most part, the authors are appropriately circumspect in their interpretations and discussions of the implications. I found the discussion of the polarity differences they found in light of separate positive and negative prediction errors, intriguing.
Weaknesses:
The primary weaknesses rest in how the results are sold and interpreted.
Most notably, the interpretation of the results of the comparison of visuomotor mismatches to the passive auditory oddball induced mismatch responses is inappropriate, as suboptimal electrode choices, unclear matching of trial numbers, and other factors. To clarify, regarding the auditory oddball portion in Figure 5, the data quality is a concern for the auditory ERPs, and the choice of Occipital electrodes is a likely culprit. Typically, auditory evoked responses are maximal at Cz or FCz, although these contacts don't seem to be available with this setup. In general, caution is warranted in comparing ERP peaks between two different sensory modalities - especially if attention is directed elsewhere (to a silent movie) during one recording and not during the other. The authors discuss this as a purely "qualitative" comparison in the text, which is appreciated, and do acknowledge the limitations within the results section, but the figure title and, importantly, the abstract set a different tone. At least, for comparisons between auditory mismatch and visuomotor mismatch, trial numbers need to be equated, as ERP magnitude can be augmented by noise (which reduces with increased numbers of trials in the average). And more generally, the size of the mismatch event at the scalp does not scale one-to-one with the size at the level of the neural tissue. One can imagine a number of variables that impact scalp level magnitudes, which are orthogonal to actual cortex-level activation - the size, spread, and polarity variance of the activated source (which all would diminish amplitude at the scalp due to polyphasic summation/cancelation). The variance of phase to a stimulus across trials (cross trial phase locking) vs magnitude of underlying power - the former, in theory, relates to bottom-up activity and the latter can reflect feedback (which has more variability in time across trials; the distance of the scalp electrode from the activated tissue (which, for the auditory system, would be larger (FCz to superior temporal gyrus) than for the visual system (O1 to V1/2)). None of this precludes the inclusion of the auditory mismatch, which is a strength of the study, but interpretations about this supporting a supremacy of sensory-motor mismatch - regardless of validity - are not warranted. I would recommend changing the way this is presented in the abstract.
Otherwise, the data are of adequate quality to derive most of their conclusions.
The authors claim that the mismatch responses emanate from within the occipital cortex, but I would require denser scalp coverage or a demonstration of consistent impedances across electrodes and across subjects to make conclusions about the underlying cortical sources (especially given the latencies of their peaks). In EEG, the distribution of voltage on the scalp is, of course, related to but not directly reflective of the distribution of the underlying sources. The authors are mostly careful in their discussion of this, but I would strongly recommend changing the work choice of "in occipital cortex" to "over occipital cortex" or even "posteriorly distributed". Even with very dense electrode coverage and co-registration to MRIs for the generation of forward models that constrain solutions, source localization of EEG signals is very challenging and not a simple problem. Given the convoluted and interior nature of human V1, the ability to reliably detect early evoked responses (which show the mismatch in mouse models) at the scalp in ERP peaks is challenging - especially if one is collapsing ERPs across subjects. And - given the latency of the mismatch responses, I'd imagine that many distributed cortical regions contribute to the responses seen at the scalp.
I think that Figure 3C, but as a difference of visual mismatch vs halting flow alone (in the open loop) might be additionally informative, as it clarifies exactly where the pure "mismatch" or prediction error is represented.
As a suggestion, the authors are encouraged to analyse time-frequency power and phase locking for these mismatch responses, as is common in much of the literature (see Roach et al 2008, Schizophrenia Bulletin). This is not to say that doing so will yield insights into oscillations per se, but converting the data to the time-frequency domain provides another perspective that has some advantages. It fosters translations to rodent models, as ERP peaks do not map well between species, but e.g., delta-theta power does (see Lee et al 2018, Neuropsychopharmacology; Javitt et al 2018, Schizophrenia research; Gallimore et al 2023, Cereb Ctx). Further, ERP peaks can be influenced by the actual neuroanatomy of an individual (especially for quantifying V1 responses). Time frequency analyses may aid in interpreting the "early negative deflection with a peak latency of 48 ms " finding as well.
Finally, the sentence in the abstract that this paradigm " can trigger strong prediction error responses and consequently requires shorter recording 20 times would simplify experiments in a clinical setting" is a nice setup to the paper, but the very fact that one third of recordings had to be removed due to movement artifact, and that hairstyle modulates the recording SnR, is reason that this paradigm, using the reported equipment, may have limited clinical utility in its current form. Further, auditory oddball paradigms are of great clinical utility because they do not require explicit attention and can be recorded very quickly with no behavioral involvement of a hospitalized patient. This should be discussed, although it does not detract from the overall scientific importance of the study. The authors should reconsider putting this statement in the abstract.
eLife assessment
This meta-analysis provides a fundamental synthesis of evidence demonstrating that transcranial magnetic stimulation targeting the hippocampal-cortical network reliably enhances episodic memory performance across diverse study designs. The evidence is convincing, with rigorous methodology and consistent effects observed despite modest sample sizes and some heterogeneity in stimulation approaches. The work highlights the specificity of memory improvements to hippocampal-dependent memories and identifies key methodological factors-such as individualized targeting-that influence efficacy. Overall, this study offers a timely and integrative framework that will inform both basic memory research and the design of future clinical trials for cognitive enhancement.