888 Matching Annotations
  1. Jan 2021
  2. Dec 2020
  3. Nov 2020
  4. Oct 2020
    1. Author Response

      Summary:

      The strengths of the study are the findings that a single oxytocin level measured from saliva or plasma is not meaningful in the way that the field might currently be measuring. The reviewers appreciated this finding, and the careful attention to detail, but felt that the results fell short.

      Reviewer #1:

      This article describes the investigation of a valuable research question, given the interest in using salivary oxytocin measures as a proxy of oxytocin system activity. A strength of the study is the use of two independent datasets and the comparison between intranasal and intravenous administration. The authors report poor reliability for measuring salivary oxytocin across visits, that intravenous delivery does not increase concentrations, and that salivary and blood plasma concentrations are not correlated.

      Line 77-78: While it's true that saliva collection provides logistical advantages, there are also measurement advantages (e.g., relatively clean matrix) that are summarised in the MacLean et al (2019) study, which has already been cited.

      Thanks for the suggestion. We added this advantage:

      Line 101Compared to blood sampling, saliva collection presents several logistical and measurement advantages (i.e. relatively clean matrix)(1).”

      Line 86: It is important to note that the 1IU intravenous dose in this study led to equivalent concentrations in blood compared to intranasal administration.

      The reviewer is right that 10 IU (over 10min) in our case increased the concentrations of plasmatic oxytocin beyond those observed for the spray or nebuliser (we reported the full time-course of variations in plasmatic oxytocin in another manuscript we published earlier this year)(2). This was an intentional aspect of our study design. We decided to use the highest intravenous dose (at the highest rate of 1IU/min) that we could get permission to administer safely in healthy volunteers as a proof of concept, so as to achieve a robust and prolonged increase in plasmatic oxytocin over the course of our full testing session. In this manner, we demonstrate that even when plasmatic levels of OT are maintained substantially increased throughout the observation interval, we cannot detect increases in salivary oxytocin. In this aspect, we believe that our manuscript goes one step beyond the important findings described in of Quintana et al. 2018(3), showing that this phenomenon is not linked to dosage (or to amount of increase in plasmatic levels of exogenous OT), as far as we can determine given the current safety standards for the administration of OT IV.

      Please see also response to Reviewer 2, point 1.

      Line 158: When using both ELISA and HPLC-MS, extracted and unextracted samples are correlated when measuring oxytocin concentrations in saliva, at least in dogs. (https://doi.org/10.1016/j.jneumeth.2017.08.033).

      Thanks for pointing out this study. Indeed, in this specific study the authors found correlations between extracted and unextracted saliva samples. Such associations in humans have nevertheless been rare. In humans, the body of evidence suggests that the measurements obtained when comparing extracted samples to unextracted samples, or when comparing samples obtained using different methods of quantification (for instance, ELISA versus radioimmunoassay), do not correlate or show very low correlations (4, 5). Furthermore, most ELISA kits and HPLC-MS protocols to measure oxytocin have so far fallen short on sensitivity to detect the typical concentrations observed in humans at baseline (0-10pg/ml)(6). The current gold-standard method for quantifying oxytocin in biological fluids is the radioimmunoassay we used in this study(4). This method has shown superior sensitivity and specificity when compared to other quantification methods, when combined with extracted samples; therefore, it was our primary choice. We now highlight this advantage in the revised version of the manuscript more explicitly.

      Line 129For all analyses, we followed current gold-standard practices in the field and assayed oxytocin concentrations using radioimmunoassay in extracted samples, which has shown superior sensitivity and specificity when compared to other quantification methods(7).

      Statistical reporting: I ran the article through statcheck R package (a web version is also available) and found a number of inconsistencies with the reported statistics and their p values. For example, on Line 302 the authors reported: t(123) = 1.54, p = 0.41, but this should yield a p value of 0.13. The authors should do the same and fix these errors.

      Thanks very much for taking the time to check our statistical reporting thoroughly. We apologize if we were not sufficiently clear in the previous version of the manuscript, but the p-values we reported are corrected for multiple comparisons using Tukey correction. Currently, statcheck can only evaluate inconsistencies when the results are reported in the standard APA style and does not take into consideration corrections for multiple comparisons of any kind. We did check all of our statistical reporting and the p-values and correspondent statistics are correct (we only corrected an inadvertent error in reporting the degrees of freedom for these tests). In any case, we have now clarified in the manuscript when the reported p-values have been adjusted for multiple comparison to avoid any further confusion.

      Line 305: The confidence intervals for these correlations should be reported.

      We have now added the confidence intervals, estimated using bootstrapping, in our results section.

      Line 348: This is an important point, but it's important to note that the vast majority of these studies use plasma or saliva measures. Perhaps CSF measures are more reliable, but the question wasn't assessed in the present study, and I'm not sure if anyone has looked at this question.

      We are not aware of any study evaluating the stability of measurements of oxytocin in the CSF. Indeed, there are only a few studies sampling CSF to measure oxytocin in clinical patients and it is unlikely that CSF will become a widely used fluid to measure oxytocin in humans, given the invasiveness of the procedure to obtain CSF samples. Here, we wanted to refer specifically to saliva and plasma, which remain as the most popular options for measuring oxytocin in humans and which we investigated specifically in the current study. We have changed the text accordingly for clarity.

      Line 466 “Our data poses questions about the interpretation of previous evidence seeking to associate single measurements of baseline oxytocin in saliva and plasma with individual differences in a range of neuro-behavioural or clinical traits.”

      Line 423: I broadly agree with this conclusion, but it should be added that "single measurements of baseline levels of endogenous oxytocin in saliva and plasma are not stable under typical laboratory conditions" Perhaps these measures can be more stable using other means (i.e., better standardising collection conditions). But the fact remains, under typical conditions these measures do not demonstrate reliability.

      Thanks for the suggestion. We have revised the text accordingly throughout the manuscript (examples below). Our study is a pharmacological study, which means that it is conducted in a highly controlled setting and adheres to strict protocols (i.e. we tested participants at the same time of the day, we instructed participants to abstain from alcohol and heavy exercise for 24 h and from any beverage or food for 2 h before scanning). These exclusion criteria were stricter than those applied in a large number of studies sampling saliva and plasma for measuring oxytocin for the purposes estimating possible associations with various traits associating. Most of these studies do not control, for instance, for fluid or food ingestion. Therefore, we expected our reliability calculations to represent an optimistic estimate of the reliabilities of the salivary and plasmatic oxytocin concentration used in most studies.

      For now, it remains unclear to us what factors might be driving the within-subject variability in salivary and plasmatic concentrations we report in this study. Thanks to Reviewer 3, we are now confident that this is unlikely to represent measurement error (see response to Reviewer 3, point 3).

      Line 117 “Here, we aimed to characterize the reliability of both salivary and plasmatic single measures of basal oxytocin in two independent datasets, to gain insight about their stability in typical laboratory conditions and their validity as trait markers for the physiology of the oxytocin system in humans.

      Line 567 “In summary, single measurements of baseline levels of endogenous oxytocin in saliva and plasma as obtained in typical laboratory conditions are not stable and therefore their validity as trait markers of the physiology of the oxytocin system is questionable.”

      Reviewer #2:

      Summary:

      To test questions whether salivary and plasmatic oxytocin at baseline reflect the physiology of the oxytocin system, and whether salivary oxytocin index its plasma levels, the authors quantified baseline plasmatic and/or salivary oxytocin using radioimmunoassay from two independent datasets. Dataset A comprised 17 healthy men sampled on four occasions approximately at weekly intervals. In the dataset A, oxytocin was administered intravenously and intranasally in a triple dummy, within-subject, placebo-controlled design and compared baseline levels and the effects of routes of administration. With dataset A, whether salivary oxytocin can predict plasmatic oxytocin at baseline and after intranasal and intravenous administrations of oxytocin were also tested. Dataset B comprised baseline plasma oxytocin levels collected from 20 healthy men sampled on two separate occasions. In both datasets, single measurements of plasmatic and salivary oxytocin showed insufficient reliability across visits (Intra-class correlation coefficient: 0.23-0.80; mean CV: 31-63%). Salivary oxytocin was increased after intranasal administration of oxytocin (40 IU), but intravenous administration (10 IU) does not significantly change. Saliva and plasma oxytocin did not correlate at baseline or after administration of exogenous oxytocin (p>0.18). The authors suggest that the use of single measurements of baseline oxytocin concentrations in saliva and plasma as valid biomarkers of the physiology of the oxytocin system is questionable in men. Furthermore, they suggest that saliva oxytocin is a weak surrogate for plasma oxytocin and that the increases in saliva oxytocin observed after intranasal oxytocin most likely reflect unabsorbed peptide and should not be used to predict treatment effects.

      General comments:

      The current study tested research questions relevant for the study field. The analyses in two independent datasets with different routes of oxytocin administrations is the strength of current study. However, the limited novelty of findings and several limitations are noticed in the current report as described below.

      Specific and major comments:

      1) Previous study with similar results has already revealed that saliva oxytocin is a weak surrogate for plasmatic oxytocin, and increases in salivary oxytocin after the intranasal administration of exogenous oxytocin most likely represent drip-down transport from the nasal to the oral cavity and not systemic absorption (Quintana 2018 in Ref 13). Therefore, the novelty of current findings is limited. The authors should more clearly state the novelty of current results and the replication of previous findings.

      We apologize for not describing the novelty and impact of our findings with sufficient clarity, and thanks for the opportunity to do so. Our study had two major goals. The first was to investigate whether single measurements of salivary and plasmatic concentrations of oxytocin can be reliably estimated within the same individual when collected at baseline conditions (i.e. without any experimental manipulation). As the reviewer highlighted, this is an important methodological question given the wide use of these measurements in a large and increasing number of studies to establish associations between the physiology of the oxytocin system and a number of brain and behavioural phenotypes in both clinical and non-clinical samples. However, to our knowledge, no previous study has appropriately conducted a thorough investigation of the reliability of these measurements (see also response to Reviewer 3, point 5). Thanks to our study, we now know that when single measurements are collected at baseline, salivary and plasmatic oxytocin cannot provide a sufficiently stable trait marker of the physiology of the oxytocin system in humans. As we highlight in the manuscript, this finding should deter the field from making strong claims based exclusively on associations of phenotypes with single measurements of peripheral oxytocin concentrations. Furthermore, our study also describes two very concrete implications of our findings which we believe are very important for the field. First, if baseline level of OT is to be used as a trait marker, future studies should, as much as possible, rely on repeated measures within the same participant but collected on different days to maximize reliability. Second, this less than perfect reliability should be taken into consideration when calculating the sizes of the samples needed to detect a certain effect, if it exists, with sufficient statistical power.

      The second goal of our study was, as pointed out by the reviewer, to revisit the findings of Quintana et al. 2018(3), but this time with two major design modifications which could strengthen the conclusions from that study. The first modification was the dose of intravenous oxytocin administered, which was considerably higher (see response to Reviewer 1, point 2). The administration of a higher dose that resulted in substantial and sustained increases in plasmatic oxytocin throughout the two hours observation period can only strengthen the previous conclusion that increases in plasmatic oxytocin cannot be detected in salivary measurements, and that this is not a matter of dose (as far as we can ascertain by administering the maximum intravenous dose we could safely administer in healthy volunteers). We believe that this is an important addition to the literature.

      The second modification regarded the choice of the method we used to quantify oxytocin. In this study, we used radioimmunoassay, which is superior to ELISA in sensitivity and hence more appropriate to measure the low concentrations of oxytocin in saliva and plasma typically detected in humans at baseline conditions (1-10 pg/ml; for most individuals 1-5 pg/ml)(6). For instance, in Quintana et al. 2018(3) the limitations in the sensitivity of the ELISA kit used led the authors to discard around 50% of the collected saliva samples. Hence, our study replicates and extends the previous findings from Quintana et al. 2018 in important ways, demonstrating that the lack of an association between increases plasmatic oxytocin and salivary measurements is not limited by the dose of intravenous oxytocin administered or limitations of the sensitivity of the method used to quantify oxytocin.

      We have now made the novelty and contribution of our work more explicit:

      *Line 77 “Currently, we lack robust evidence that single measures of endogenous oxytocin in saliva and plasma at rest are stable enough to provide a valid trait marker of the activity of the oxytocin system in healthy individuals. Indeed, previous studies have claimed within-individual stability of baseline plasmatic and salivary concentrations of oxytocin in both adults and children based on moderate-to-strong correlations between salivary and plasmatic oxytocin concentrations measured repeatedly within the same individual over time using ELISA in unextracted samples(14-16). However, these studies have a number of methodological limitations that raise questions about the validity of their main conclusion that baseline plasmatic and salivary concentrations are stable within individuals. First, measuring oxytocin in unextracted samples has been postulated as potentially erroneous, given the high risk of contamination with immunoreactive products other than oxytocin(4). It is conceivable that these non-oxytocin immunoreactive products might constitute highly stable plasma housekeeping proteins (17) that masked the true variability in oxytocin concentrations. Second, a simple correlation analysis cannot provide information about the absolute agreement of two sets of measurements – which would be a more appropriate approach to study within-subject reliability/stability. Third, it is not clear whether these findings generalize beyond the early parenting(14) or early romantic(15) periods participants were in when the studies were conducted, since these periods engage the activity of the oxytocin system in particular ways(18). Hence, establishing the validity of salivary and plasmatic oxytocin as trait markers of the activity of the oxytocin system in humans remains as an unmet need. Such evidence is urgently required, given reports that plasma and saliva levels of oxytocin are frequently altered during neuropsychiatric illness and that they co-vary with clinical aspects of disease(13).

      Line 509 “Our findings were not consistent with these expectations. We could replicate previous evidence that intravenous oxytocin does not increase salivary oxytocin(3) and extended it by showing that the lack of increase in salivary oxytocin is not limited to the specific low dose of intravenous OT that was previously used (1IU) and that it is not driven by the insufficient sensitivity of the OT measurement method (which had resulted in more than 50% of the saliva samples being discarded in the previous study(3).”*

      2) As authors discussed in the limitation section of discussion, the current study has several limitations such as analyses only in male participants and non-optimized timing of collection of saliva and blood due to the other experiments. These limitations are understandable, because the current study was the second analyses on the data of the other studies with the different aims. However, these limitations significantly limit the interpretations of the findings.

      Here, we would like to highlight two aspects. First, most studies in the field are indeed conducted in men to avoid potential confounding from fluctuations in oxytocin concentrations across the menstrual cycle in women. Therefore, our study is representative of the typical samples used in most human studies. Second, we did not optimize our study to collect repeated samples of saliva. Indeed, it would have been interesting to describe the full-time course of variations of oxytocin concentrations in saliva after intranasal and intravenous administration. However, this does not detract the importance of our findings in respect to our first aim (which was our main goal).

      We agree with the reviewer though that it is at least theoretically possible that we could have missed the window for increases in salivary oxytocin after intravenous oxytocin if it existed, given that we only sampled one post-administration time-point. However, we believe this was unlikely for one reason. Despite the sustained increase (throughout the two-hour observation interval) in plasmatic oxytocin following the intravenous administration of oxytocin, we observed no increase in salivary oxytocin post-dosing (at ~115 min). Unless the half-life of oxytocin is shorter in saliva than in the blood (which we do not know yet), we expected the levels of salivary oxytocin to mirror the changes in plasma – potentially with a slight delay given the time that it might take for oxytocin concentrations to build up in saliva through ultrafiltration from the blood, but this was not the case. Most likely the half-life of oxytocin in the saliva is not shorter than in the blood, since a previous study found increased concentrations of oxytocin in saliva up to 7h after administration of intranasal oxytocin (as the reviewer pointed out below, in our study we no longer could detect significant increases in plasmatic oxytocin after the intranasal administration of 40 IU with two different methods at around 115 mins post-administration). Therefore, while we acknowledge these limitations we also believe they do not detract from the importance of our main findings and the potential they hold to influence the field towards a more rigorous use of these measurements. Please see below for the implemented changes in the text.

      Line 554 “It is possible that we may have missed peak increases in saliva oxytocin after the intravenous administration of exogenous oxytocin if they occurred between treatment administration and post-administration sampling. This is unlikely given that the dose we administered intravenously resulted in sustained increases in plasmatic oxytocin over the course of two hours. Unless the half-life of oxytocin in saliva is much shorter than in the plasma, it would be surprising to not find any increases in salivary oxytocin after intravenous oxytocin given that concentrations of oxytocin in the plasma were still elevated at the specific time-point of our second saliva sample. Currently, we have no estimate for the half-life of oxytocin in saliva; however, given that previous studies have found evidence of increased salivary oxytocin after single intranasal administrations of 16IU and 24IU oxytocin up to seven hours post-administration(19), it is unlikely that the half-life of oxytocin is shorter in the saliva than in the plasma.

      3) As reported in page 6, the dataset A comprises administrations approximately 40 IU of intranasal oxytocin and 10 IU on intravenous. The rationale to set these doses should be described. Since the 40IU is different from 24 IU which is employed in most of the previous publications in the research field, potential influence associated with the doses should be tested and discussed.

      Thank you for the opportunity to clarify this aspect of our work. With respect of our primary aims (to investigate whether single measurements of salivary and plasmatic oxytocin at baseline can be reliably measured within individuals across different days), the choice of doses is of course not relevant.

      With respect to our secondary aim, namely, to investigate whether salivary oxytocin can be used to index concentrations of oxytocin in the plasma, particularly after the administration of synthetic oxytocin using the intranasal and intravenous routes, the administered doses are relevant.

      The data reported here were collected as part of a larger project – which determined the choice of both intranasal and IV doses (2). As explained in our response to Reviewer 1, point 2, the selection 10IU (over 10min) was the highest intravenous dose that we could get permission to administer safely in healthy volunteers as a proof of concept, so as to achieve a robust and prolonged increase in plasmatic oxytocin over the course of our full testing session. In this manner, we demonstrate that even when plasmatic levels of OT are maintained substantially increased throughout the observation interval, we cannot detect increases in salivary oxytocin.

      Regarding the intranasal OT dose, it is worth noting that the 24 IU is indeed popular in oxytocin studies, but not exclusive, and generally the selection of dose in oxytocin studies has not been informed by detailed dose-response characterizations. Our choice of 40IU was made for the purposes of matching our previous work on the pharmacodynamics of OT in healthy volunteers(20), and is a dose we (21-29) and others (e.g. (30)) have commonly used with patients.

      A potentially important implication if dose variations also imply variation in the total volume of liquid administered (as is usually the case with standard nasal sprays – but not with the nebuliser), then it is likely that the potential for drip-down might increase for higher volumes and decrease for lower volumes. As far as we know, no study has ever investigated the impact of administered volume on salivary oxytocin after the intranasal administration of synthetic oxytocin, but we agree this would be an important point to look at. We have now expanded our discussion to accommodate this point.

      Line 519 “We expect this phenomenon to be particularly pronounced for higher administered volumes. Further studies should examine the impact of different administered volumes on increases in salivary oxytocin.”

      4) It is difficult to understand that no significant elevations in plasma oxytocin levels were observed after intranasal spray or nebuliser of oxytocin. From figure 4A, the differences between levels at baseline and post administration are similar between nebuliser, spray, and placebo. Please discuss the potential interpretation on this result.

      The plasmatic concentrations of oxytocin we report in this study refer solely to the samples acquired at around 2h after the administration of intranasal oxytocin. We reported the full-time course of changes in plasmatic oxytocin in a paper published earlier this year(2) – which we now refer the reader to. We did find increases in plasmatic oxytocin after administration of oxytocin with the spray and nebuliser (around 3x the baseline concentrations) that did not differ between intranasal methods of administration. Plasmatic oxytocin reached a peak within 15 mins from the end of the intranasal administrations. Given the short half-life of oxytocin in the plasma, we believe it is not surprising that at 115 mins after the end of our last treatment administration the concentrations of oxytocin in the plasma are no longer different from the placebo condition.

      Line 166 “The full time course of changes in plasmatic oxytocin after the administration of intranasal and intravenous oxytocin in this study has been reported elsewhere(2).”

      5) In page 12, the reason why not to employ any correction for multiple comparisons in the statistical analyses should be clarified.

      We apologize that this was not sufficiently clear, but we did correct for multiple testing using the Tukey procedure in our analyses investigating the effects of treatment on salivary and plasmatic oxytocin (this was described in page 9 – Treatment effects). If the reviewer meant something else, we would be glad to follow any further advice on multiple testing correction he/she might have.

      Line 250 “Treatment effects: The effect of treatment on blood/saliva oxytocin concentration were assessed using a 4 x 2 repeated-measures two-way analysis of variance Treatment (four levels: Spray, Nebuliser, Intravenous and Placebo) x Time (two levels: Baseline and post-administration). Post-hoc comparisons to clarify a significant interaction were corrected for multiple comparisons following the Tukey procedure.

      Reviewer #3:

      In the current study, baseline samples of salivary and plasma oxytocin were assessed in 13, respectively, 16 participants, to assess intra-individual reliability across four time points (separated by approximately 8 days). The main results indicate that, while as a group, average salivary and plasma samples were not significantly different across time points, within-subject coefficient of variation (CV) and intra-class correlation coefficient (ICC) showed poor absolute and relative reliability of plasma and salivary oxytocin measurements over time. Also no association was established between plasma and salivary levels, either at baseline or after administration of oxytocin (either intranasally, or intravenously). Further, salivary/ plasma oxytocin was only enhanced after intranasal, respectively intravenous administration.

      The study addresses an important topic and the paper is clearly written. While the overall multi-session design seems solid, sample collections were performed in the context of larger projects and therefore there appear to be several limitations that reduce the robustness of the presented results and consequently the formulated conclusions.

      General comments

      1) A main conclusion of the current work is that 'single measures of baseline oxytocin concentrations in saliva and plasma are not stable within the same individual'. It seems however that the study did not adhere to a sufficiently rigorous approach to put forward this conclusion. It lacks a control for several important factors, such as timing of the day at which saliva/ plasma samples were obtained, as well as sample volume. Particularly while it is indicated that all visits were identical in structure, important information is missing with regard to whether or not sampling took place consistently at a particular point of time each day, to minimize the influence of circadian rhythm. Without this information it is not possible to draw any firm conclusions on the nature of the intra-individual variability as demonstrated in the salivary and plasma sampling.

      Thanks for pointing this out. Indeed, we were not sufficiently explicit on how strict we were in controlling for some potential sources of variability that could have contributed to the lack of reliability we report here. Our data was acquired in the context of two human pharmacological studies, which by design were strict on a number of aspects to minimize unwarranted noise. All participants were tested in the same period of the day (morning) to avoid the potential contribution of circadian fluctuations of oxytocin. In dataset A, we tried, as much as possible, to match the exact time participants were tested between visits, using the start time of the first visit as a reference. With the exception of one participant, where one session was conduct 1h and 30 mins later than the other three, all the remaining participants from study A were tested within 1h of the exact start time of session 1. Further, we also instructed participants to abstain from alcohol and heavy exercise for 24 h and from any beverage or food for 2 h before scanning. Hence, we believe our sampling protocol was strict enough to discard any potential contribution of major known sources of variability in oxytocin levels.

      The reviewer also inquiries about the volume of the samples. For the plasma samples, we used a standardized protocol and collected the same blood volume in all participants, visits and time-points (1 EDTA tube of approximately 4 ml). The saliva samples were collected using Salivettes. Participants were instructed to place the swab from the Salivette kit in their mouth and chew it gently for 1 min to soak as much saliva as possible. After this, the swab was then returned back to the Salivette and centrifuged. In both cases, to avoid degradation of the peptide in the collected sample, we followed a strict protocol where all samples were put immediately in iced water until centrifugation, which happened within 20 mins of sample collection. Samples were then immediately stored at -80C until analysis. Hence, differences in degradation of the peptide related to the processing of the sample are also unlikely to justify the poor reliabilities we report here.

      For completeness, we have now added all of these further details to our Methods section.

      Line 169 “**All visits were conducted during the morning to avoid the potential confounding of circadian variations in oxytocin levels(31, 32). In addition, we also made sure that each participant was tested at approximately the same time across all four visits (all participants were tested in sessions with less than one hour difference in their onset time, except for one participant where the difference in the onset of one session compared to the other three sessions was 1.5h). “*

      Line 192 “Blood was collected in ethylenediaminetetraacetic acid vacutainers (Kabe EDTA tubes 078001), placed in iced water and centrifuged at 1300 × g for 10 minutes at 4°C within 20 minutes of collection and then immediately pipetted into Eppendorf vials. Samples were immediately stored -80C until analysis. Saliva samples were collected using a salivette (Sarstedt 51.1534.500). Participants were instructed to place the swab from the Salivette kit in their mouth and chew it gently for 1 min to soak as much saliva as possible. After this, the swab was then returned back to the Salivette, centrifuged and stored in the same manner as blood samples. For both saliva and plasma, we stored the samples in aliquots of 0.5 ml, following the RIAgnosis standard operating procedures. We followed this strict protocol, putting all samples in iced water until centrifugation with immediate storage at -80C until analysis to minimize the impact putative differences in degradation of the peptide related to differences in the processing of the samples might have on the reliability of the estimated concentrations of oxytocin.” *

      Correspondingly, a deeper discussion is needed on the reason why ICC's were considerably variable across pairs of assessment sessions, with some pairs yielding good reliability, whereas others yielded (very) poor reliability.

      Currently we have no insightful hypothesis on why this could have been the case. Indeed, we found higher ICCs for only 2 out of 6 pairs of visits for the plasma. However, it is plausible that this might have occurred by chance. In any case, we should note that the 95% confidence intervals for the ICCs of our different pairs of samples overlap; this suggests that there is no evidence that the ICCs we estimated for the specific two pairs where we found higher reliabilities are significantly higher than those observed in the remaining pairs.

      Line 431 “If there are specific reasons explaining the higher reliability indices observed for the specific pairs of sessions, these reasons remain to be elucidated. However, it is not implausible that we might have found higher reliabilities for these specific two pairs by chance, since the 95% confidence intervals for the ICCs for all pairs of samples overlapped.

      More detailed descriptions regarding sampling procedures (timing and sampling intervals) are necessary. Also, more information is needed on the volume of saliva collected at each session, to control for possible dilution effects.

      This information has been added to the revised version of the manuscript (please see response to your point number 1). As a further clarification, oxytocin concentrations were measured in plasma and saliva aliquots of 0.5 ml, following the standard operating procedures of RIAgnosis. This volume was used for all participants, sessions and time-points. Furthermore, for measuring cortisol, the salivettes were shown to allow for an almost 100% recovery, regardless of cortisol concentration, volume of the sample or method of quantification(33), suggesting that the sampling method is robust.

      2) It is indicated that the initial sample would allow to detect intra-class correlation coefficients (ICC) of at least 0.70 (moderate reliability) with 80% of power. Is this still the case after the drop-outs/ outlier removals? Since the main conclusions of the work rely on negative results (conclusions drawn from failures to reject the null hypothesis) it is important to establish the risk for false negatives within a design that is possibly underpowered.

      We understand the concern of the reviewer. However, according to the power calculations provided by Bujang and Baharum, 2017(34), the four repeated samples we collected in Dataset A would have allowed us to detect an ICC of 0.5 with 80% of statistical power even with only 13 subjects (which is the lowest sample size we used for the analysis on saliva in dataset A). The two samples we collected in Dataset B would allow us to detect an ICC of 0.6 with 80% of statistical power even with only 19 subjects. Hence, both datasets were powered to detect an ICC of 0.7 with acceptable power, if it existed, even after the exclusion of outliers.

      3) Did the authors also assess within-session reliability? For example, by assessing ICC between pre and post-measurements in the placebo session.

      Thanks for the suggestion. Indeed, we had not performed this analysis before but we agree it would be informative. We calculated the ICC and CV for the two samples acquired before any treatment administration and the intravenous infusion of saline during the placebo session. These samples where acquired with an approximate 15 min interval in between them. In this analysis, we found that the ICC was excellent 0.92 and the CV 20%. This additional analysis strengthens our findings by supporting the idea that our poor reliabilities across different days reflect true biological variability and cannot be attributed to measurement error. These new findings have now been included in the revised version of the manuscript.

      Abstract

      Line 44 "Results: Single measurements of plasmatic and salivary oxytocin showed poor reliability across visits in both datasets. The reliability was excellent when samples were collected within 15 minutes from each other in the placebo visit.”

      Line 240 “Within-visit reliability analysis: To investigate the reliability of salivary and plasmatic oxytocin concentration within the same visit, we calculated the ICC and CV as described above for two samples acquired before any treatment administration and the intravenous infusion of saline during the placebo session. These samples where acquired with an approximate 15 minutes interval in between them.

      Line 405 “Furthermore, in a further analysis assessing the within-session stability of plasmatic oxytocin using two measurements collected 15 minutes apart from each other in the placebo visit (one sample collected at baseline and the other after the intravenous administration of saline), we found excellent within-session reliability (ICC=0.92, CV=20%). Together, this suggests that the low reliability of endogenous oxytocin measurements across visits in the current study results from true intrinsic individual biological variability and not technical variability/error in the method used for oxytocin quantification.“*

      4) It is indicated that the intra-assay variability of the adopted radioimmunoassay constitutes <10%. Were analyses of the current study run on duplicate samples? Was intra-assay variability assessed directly within the current sample?

      We reported the intra-assay variability determined by RIAgnosis during the development of this assay(35). This was not specifically assessed for the current study.

      Introduction & Discussion

      5) The introduction and discussion is missing a thorough overview of previous studies assessing intra-individual variability in oxytocin levels.

      Thanks for the suggestion. We have now included in our introduction/discussion an overview of previous studies attempting to tackle this question, which unfortunately do not address this question with sufficient detail or using the appropriate methods and statistical analyses (see response to Reviewer 2, point 1). Hence, from the available evidence, it is not possible to draw robust conclusions about the validity of concentrations of oxytocin in saliva and plasma as valid trait markers of the activity of the oxytocin system. With this manuscript, we hope we can prompt further discussion and guide the field towards a more rigorous use of these measurements. A thorough discussion of this literature has now been added to the Introduction and Discussion.

      Line 434 “Our observation of poor reliability questions the use of single measurements of baseline oxytocin concentrations in saliva and plasma as valid trait markers of the physiology of the oxytocin system in humans. Instead, we suggest that, at best, these measurements can provide reliable state markers within short time-intervals (5 mins in our study). Our data does not support previous claims of high stability of plasmatic and salivary oxytocin within individuals over time. For instance, in one study, Feldman et al. (2013) assessed plasmatic oxytocin in recent mothers and fathers at two time-points spaced six months apart during the postpartum period. The authors found strong correlations between the two assessments for both mothers and fathers(14). In another study, Schneiderman et al. (2012) found strong correlations between plasmatic oxytocin concentrations measured at two different instances spaced six months apart in both single and individuals recently involved in a new romantic relationship(15). Two important differences between these studies and ours are i) the method used for oxytocin quantification, and ii) the particular states participants were in when the studies were conducted. Regarding the first difference, these previous studies used ELISA without extraction, reporting concentrations of plasmatic oxytocin well above the typical physiological range of 1-10 pg/ml detected in extracted samples (in their studies, the authors report concentrations above 200 pg/ml). The inclusion of extraction has been postulated as a critical step for obtaining valid measures of oxytocin in biological fluids(4). Unextracted samples were shown to contain immunoreactive products other than oxytocin(4), which contribute largely to the concentrations of oxytocin estimated by this method. It is possible that these non-oxytocin products might represent highly stable plasma housekeeping molecules(17) that masked the true biological variability in oxytocin concentrations between assessments in these previous studies that we could detect in extracted samples in our study. Regarding the second difference, these previous studies on within-individual stability were conducted during the early parenting(14) or early romantic(15) periods, which engage the activity of the oxytocin system in particular ways(18). Instead, we used a normative sample that did not specify these inclusion criteria. Hence, we cannot exclude that during these specific periods the reliability of salivary and plasmatic oxytocin concentrations might be higher. We note though that our sample more closely resembles the samples used the vast majority of studies in the field (which sometimes even exclude participants during early parenthood(36)). Hence, our estimates of reliability are a better starter point for all studies where specific circumstances potentially affecting the activity of the oxytocin system have not been specified a priori.

      6) The paper misses a discussion of previous studies addressing links between salivary/ plasma levels and central oxytocin (e.g. in cerebrospinal fluid). I understand the claim that salivary oxytocin cannot be used to form an estimate of systemic absorption, although technically, a lack of a link between salivary and plasma levels, does not necessarily imply a lack of a relationship to e.g. central levels. The lack of effect is limited to this specific relationship.

      In this study, we did not intend to investigate whether salivary and plasmatic oxytocin are valid proxies for the activity of the oxytocin system in the brain. Our data does not address that question and a thorough discussion of these studies falls, in our opinion, out of the scope of the manuscript. Instead, we focused on whether measurements of oxytocin in saliva and plasma (by far the most commonly used biological fluids to measure oxytocin) are sufficiently stable to provide valid indicators of the physiology of the oxytocin system in humans. Additionally, we also investigated whether salivary oxytocin can index plasmatic oxytocin at baseline and after the administration of synthetic oxytocin using different routes of administration.

      A previous meta-analysis of studies correlating peripheral and CSF measurements of oxytocin has shown that most likely peripheral and CSF measurements do not correlate at baseline; significant correlations could be found after intranasal administration of oxytocin or specific experimental manipulations, such as stress(37). We believe that currently we still do not have a clear answer about the extent to which these peripheral fluids can actually index oxytocin concentrations in the brain (even if associations with CSF are evident in specific instances). For instance, no study has ever shown that CSF oxytocin actually predicts the concentrations of oxytocin in the extracellular fluid of the brain. Given what we currently know about the synaptic release of oxytocin in the brain(38) (in contrast with former theories of exclusive bulk diffusion in the CSF(39)), we think we have good reasons to suspect this might not be the case.

      The only contribution our study can make in that respect is highlighting our current lack of understanding of how oxytocin reaches saliva if not from the blood. Currently there is no evidence of direct secretion of oxytocin to the saliva (not from acinar secretion or nerve terminals release). Hence, as it stands, the most likely mechanism for oxytocin to entry the saliva is from the blood (for instance, by ultrafiltration). If increases in plasmatic oxytocin after intravenous oxytocin cannot produce any significant increases in salivary oxytocin (shown in ours and in a previous study), how does oxytocin reach the saliva and why might it be able to predict concentrations in the CSF, if it does? In this respect, we hope our study highlights the need for further research shedding light on the mechanisms underlying these potential saliva – CSF relationships, if they exist. We would be glad to accommodate any other hypothesis the reviewer might have on this respect.

      Line 522 “The lack of increase in salivary oxytocin after the intravenous administration of exogenous oxytocin that was consistently found in our study and in a previous study(3) also raises the question of how oxytocin reaches the saliva if not from the blood. Currently there is no evidence of direct acinar secretion or direct nerve terminals release of oxytocin to the saliva; therefore, transport from the blood remains as the most plausible mechanism of appearance of oxytocin in the saliva. Clarifying these mechanisms of transport is paramount, given the current hypothesis that salivary oxytocin might be superior to plasma in indexing central levels of oxytocin in the CSF(40).

      Methods

      7) Related to the general comment, the variability in days between sessions is relatively high (average 8.80 days apart (SD 5.72; range 3-28). However, it appears that no explicit measures were taken to control the conducted analyses for this variability.

      Thanks for point this out. Indeed, we were not sufficiently thorough in exploring the impact of this potential variability in the time gap between visits on our estimated ICCs. Thanks to the reviewer we now acknowledged this limitation of our analysis and decided to explore this further. We decided to run the following sensitivity analysis. First, we went back to our dataset A and identified all pairs of consecutive measures that were collected with an exact time interval of 7 days between visits. We could retrieve 15 examples of these pairs from 15 different participants for both saliva and plasma. Then, we recalculated the ICC and CV on this subset of our initial sample. In line with our main analysis, we found poor reliabilities for both salivary and plasmatic oxytocin; in both cases the ICCs were not significantly different from 0 and the CVs were 49% and 40%, respectively. This further analysis has been added to the revised version of the manuscript. We hope the reviewer shares our vision that our main conclusion of poor reliabilities of single measurements of baseline oxytocin in saliva and plasma cannot be simply attributed to the variability in the number of days between visits.

      Line 229 “Since there was considerable variability in the time-interval between visits across participants, we conducted a sensitivity analysis where we repeated our reliability analysis focusing on 15 pairs of consecutive measures that were collected with an exact time interval of 7 days between visits in 15 participants. Here, we recalculated the ICC and CV on this subset of our initial sample, using the approach described above.

      Line 399 “These poor reliabilities are unlikely to be explained by variability in the time-interval between visits of the same individual, since we also found poor reliability indexes for both saliva and plasma when we restricted our analysis to a subset of our sample controlling for the exact number of days spacing visits.”*

      8) A rationale for the adopted dosing and timing (115 min post administration) of the sample extraction is missing. Additionally, it seems that intravenous administrations were always given second, whereas intranasal administrations were given third, with a small delay of approximately 5 min. Hence, it seems that the timing of 115 min post-administration is only accurate for the intranasal administration.

      We collected saliva samples before any treatment administration and after the end of our scanning session (collection of saliva samples in between was just not possible because the participants were inside the MRI machine and could not have moved their heads). For the plasma, we collected samples before any treatment administration, after each treatment administration and at other five time-points during the scanning session. Here, we only report the plasma data that was acquired concomitantly with the saliva samples (the full-time course of plasma changes in plasmatic oxytocin has been reported elsewhere(2)). In the manuscript, we report post-administration times from the end of the full treatment administration protocol. Hence, as the reviewer highlights our post-administration sample was collected at around 115 mins from the last intranasal administration and 120 mins from the end of the intravenous administration. We have now made this aspect explicit in the revised version of the manuscript.

      Line 162 “For the purposes of this report, we use the plasmatic and salivary oxytocin measurements that were obtained at baseline and at 115 minutes after the end of our last treatment administration (this means that our post-administration samples were collected 115 mins after the intranasal administrations and 120 mins after the intravenous administration of oxytocin).

      9) Since the ICC of baseline samples showed poor reliability, it seems suboptimal to pool across sessions for assessing the relationship between salivary and blood measurements. It should be possible to perform e.g. partial correlations on the actual scores, thereby correcting for the repeated measure (subject ID). Further, since the sample size is relatively small (13 subjects), it might be recommended to use non-parametric (e.g. Spearmann correlations) instead of Pearson. The additional reporting of the Bayes factor is appreciated; it is very informative.

      Thanks for the suggestion. In fact, for the correlation the reviewer mentions we indeed used a multilevel approach where we specified subject as a random effect (please see pages 9-10). This allowed us to deal with the dependence of measurements coming from the same subject in different visits. Furthermore, since we also had concerns about the sample size, we calculated Pearson correlations but used bootstrapping (1000 samples) to obtain the 95% confidence intervals and assess significance. Bootstrapping is a robust statistical technique which allows significance testing independently of any assumptions about the distribution of the data and is robust to outliers. Please see page 12 of the manuscript, section “Association between salivary and plasmatic oxytocin levels”.

      10) Now, the authors only compared relationships between salivary and plasma levels, either at baseline or post administration. I'm wondering whether it would be interesting to explore relationships between pre-to-post change scores in salivary versus plasma measures.

      Thanks for the suggestion. We have now conducted this further analysis and we could not find any significant correlation between changes from baseline to post-administration in any of our treatment conditions. As for our other correlation analyses, here we also conducted Bayesian inference, which supported the idea that the null hypothesis of no significant correlation between changes in saliva and plasma from baseline to post-administration is at least 4x more likely than the alternative hypothesis. This further analysis strengthens our confidence that changes in salivary oxytocin after administration of oxytocin using the intranasal and intravenous routes should not be used to predict systemic absorption to the plasma.

      Line 260 “*As a final sanity check, we also investigated correlations between the changes from baseline to post-administration in saliva and plasma in each of our treatment conditions separately.

      Line 485 “Furthermore, we could not find any significant correlation between changes in salivary or plasmatic oxytocin from baseline to 115 mins after the end of our last treatment administration in any of our four treatment conditions. The lack of significant associations between salivary and plasmatic oxytocin (and respective changes from baseline) was further supported through our Bayesian analyses which demonstrated that given our data the null hypotheses were at least three times more likely than the alternative hypothesis.”*

      11) Please provide more information on the outlier detection procedure (outlier labelling rule).

      This information has now been added to the revised version of the manuscript.

      Line 271 “Outliers were identified using the outlier labelling rule(41); this means that a data point was identified as an outlier if it was more than 1.5 x interquartile range above the third quartile or below the first quartile.”*

      12) Please indicate how deviations from a Gaussian distribution were assessed.

      We used the combined assessment of i) differences between mean and median; ii) skewness and kurtosis; iii) histogram; iv) Q-Q plots; and v) the Kolmogorov-Smirnov and Shapiro-Wilk normality tests. Deviations from a normal distribution is common in the concentration of several analytes in the saliva (42), including oxytocin (15); hence, following the current recommendations, we used log transformations of the raw concentrations but plot the raw concentrations to facilitate the interpretation of our plots.

      Results

      13) Please verify the degrees of freedom for the post-hoc tests performed to assess pre-post changes at each treatment level (e.g. baseline vs Post administration: Spray - t(122) = 7.06, p < 0.001) . Why is this 122? Shouldn't this be a simple paired-sample t-test with 13 subjects?

      We apologize for this oversight. Indeed, we did a mistake in copying the values of the degrees of freedom from SPSS. We have now corrected these values. All the other p-values and F or T values were reported correctly and hence are not changed in the revised version of the manuscript (please see also response to Reviewer 1, question 4 regarding inconsistencies in the reported p-values).

    1. Author Response

      We thank the Editor of eLife f or kindly considering our manuscript for publication and for soliciting three peer reviews. We note that the reviews were positive for the most part. We sincerely believe that the key criticisms arise regrettably from a seeming misunderstanding of the motivation and context of our work – one that we hoped was a candid presentation of available data for tarantulas and the methods used. We provide detailed responses to the reviewers’ concerns below. We further note that our manuscript has since been published with minimal changes (Foley et al. 2020 Proceedings of the Royal Society B 287: 20201688, doi:10.1098/rspb.2020.1688).

      Tarantulas belong to an enigmatic and charismatic group with a nearly cosmopolitan distribution and intriguingly show vivid coloration despite being mostly nocturnal/ crepuscular. Using a robust phylogeny based on a comprehensive transcriptomic dataset that includes nearly all theraphosid subfamilies (except Selenogyrinae), we performed both discrete and continuous ancestral state reconstructions of blue and green coloration in tarantulas using modern phylogenetic methods. Using phylogenetic correlation tests, we evaluated various possible functions for blue and green coloration, for instance aposematism and crypsis. Our results suggest green coloration is likely used in crypsis, while blue (and green) coloration show no correlation with urtication, stridulation or arboreality. Our findings also support a single ancestral origin of blue in tarantulas with losses being more frequent than gains, while green color has evolved multiple independent times but never lost. We comparatively assessed opsin expression from the transcriptomic data across tarantulas to understand the functional significance of blue and green coloration. Our opsin homolog network shows that tarantulas possess a rather diverse suite of regular arthropod opsins than previously appreciated.

      While color vision in (jumping) spiders is relatively well studied, to the best of our knowledge, this is the first study to comparatively consider the identity of opsin expression across tarantulas, and in relation to the evolution of coloration. Our study challenges current belief (e.g., Morehouse et al. 2017 doi: 10.1086/693977 and references therein; Hsiung et al. 2015 doi: 10.1126/sciadv.1500709) that tarantulas are incapable of perceiving colors, at least from a molecular perspective and suggests a role for sexual selection in their evolution. This also adds to the growing body of knowledge on the complexity of arthropod visual systems (e.g., see Futahashi et al. 2015 doi:10.1073/pnas.1424670112, Hill et al. 2002 doi:10.1126/science.1076196).

      In short, we believe our results are timely and pertinent broadly to sensory biologists, behavioural ecologists and evolutionary biologists as it is an exhortation for sorely needed behavioural and sensory experiments to understand proximate use of vivid coloration in this enigmatic group.

      Summary:

      This study offers some interesting data and ideas on colour evolution in tarantulas, building upon previous work on this topic. However, the reviewers judged that the insights are too taxon-specific and that several key conclusions are too speculative. There were also concerns about the methodology for trait scoring from photographs that the authors might consider going forward.

      Reviewer #1:

      This study investigates the evolution of blue and green setae colouration in tarantulas using phylogenetic analyses and trait values calculated from photographs. It argues that (i) green colouration has evolved in association with arboreality, and thus crypsis, and (ii) blue colouration is an ancestral trait lost and gained several times in tarantula evolution, possibly under sexual selection. It also uses transcriptome data to identify opsin homologs, as indirect evidence that tarantulas may have colour vision.

      Otherwise, a few comments:

      1) Given that data is limited for the family (only 25% of genera could be included in this study), it seemed a shame not to discuss further the variation in colour and habit within genera. Based on Figure 1 and supplementary tables, the majority of "blue" genera contain a mix of blue and not-blue (and not-photographed) species. Does this mean that blue has been lost many more times in recent evolutionary history? And how often are "losses" on your tree likely to be the result of insufficient sampling for the genus (i.e. you happen not to have sampled the blue species)?

      First, the taxa in our robust and well-resolved phylogeny are representative of the major lineages within Theraphosidae, i.e., we have sampled nearly all theraphosid subfamilies (except Selenogyrinae). Our ideal is also to work with a more complete genus-level molecular phylogeny and corresponding color dataset for theraphosidae. However, this group is generally not well represented in museum collections (let alone in digitized collections), while the pet trade is focussed on only a select number of taxa. While we appreciate the reviewer’s concern that adding more taxa and corresponding data could potentially change the results, we believe that with a strong backbone phylogeny recovering the major branches, the results should not change all that much (For instance, cf. Hackett et al. 2008 10.1126/science.1157704 vs. Prum et al. 2016 10.1038/nature19417, where the initial Hackett et al. backbone is robust to increased sampling). Although the way trait losses are concentrated towards the tip suggests that using a genus-level phylogeny would perhaps show a few more recent trait losses, but unlikely to contradict an ancient origin of blue coloration at the base of this group, especially given the way the outgroups are polarized (i.e., outgroups also exhibit blue).

      2) A key conclusion of the study is that sexual selection should not be discarded as a possible explanation for spider colour. However, there is very little detail given in the discussion to build this case. Do these spiders have mating displays that might plausibly include visual signals? How common are sexually-selected colours in spiders generally? Where on the body is the blue coloration (in cases where it is not whole body)? I also missed whether the images used are of males or females or both, or how many species show sexual dimorphism in colouration (mentioned briefly in the Discussion, but not summarised for species or genera).

      We agree with the reviewer that we should have provided more information regarding sexual dichromatism in tarantulas, and on the images we used in the study (whether male/female). However, the location of blue coloration varies wildly with species – some species have blue chelicerae, blue abdomens, or blue carapaces while others are entirely blue. We also know very little about mating (and selection, if any) strategies in tarantulas, let alone the sensory ecology of this group. However, there is intriguing anecdotal information from one species (Aphonopelma) that they can be active as early as 4pm (Shillington 2002 Canadian J. Zoology, 80: 251-259, doi: 10.1139/z01-227), while some species show an intensification of color upon maturation, often a hallmark of sexual selection. Indeed, we believe that our work will incite broad interest on these intriguing questions.

      3) A quick scroll through the amazing images on Rick West's site suggests that oranges and red/pinks are not rare in tarantulas. Perhaps the data is just not available, but it would be good to mention somewhere the rationale behind the blue/green focus, rather than examining all colours.

      We agree. However, in the present study, we focused on blue and green colors because the data is readily available and we wanted to build upon the previous work by Hsiung et al 2015. Given that violet/blue and likely also some green coloration are structural in origin (Saranathan et al. 2015 Nano Letters, doi: 10.1021/acs.nanolett.5b0020; Hsiung et al. 2015), these hues are unlikely to fade or vary between individuals unlike diet acquired pigmentary coloration. Hence, these colors perhaps better lend themselves to analyses using digital photographs.

      I suggest defining stridulating / urticating setae for non-specialist readers. I had to look these up to understand that they were involved in defence.

      We thank the reviewer for this suggestion.

      I notice the Rick West website says species IDs should not be made from photos alone. Is there a risk of misidentification for any photos?

      We understand the reviewer’s concern. However, Rick West is an experienced arachnologist and quite knowledgeable in tarantula systematics and taxonomy (see https://www.tarantupedia.com/researchers/rick-c-west), which is why we endeavoured to use his website as extensively as possible without resorting to photos from hobbyists. We further validated the IDs with field guides, when in doubt.

      The Results section would benefit from some more clear statements of key results. For example, phrases like "AIC values to assess the relationships between greenness and arboreality are reported in Table 3" could be replaced instead with a summary statement indicating what this table shows.

      We agree and thank the reviewer for this suggestion.

      In the Figure 1 caption I think there is a typo: 'the proportions of species with images that possess blue colouration (grey = no available images)" but should this say "grey = not blue"?

      We apologize for the confusion. This is not a typo – this is in relation to Trichopelma, for which no images of described species were available, and so we cannot conclude that none of the taxa are blue/green.

      142 - the lengthy discussion here of whether there is one or more mechanisms by which blue is produced in tarantulas, and the detailed criticism of Hsuing SEMs, seems a bit out of place given that the current study does not investigate the proximate mechanism of blue colouration but merely its presence.

      We respectfully disagree. The core support for Hsiung et al.’s (2015) argument against sexual selection as a driver of color evolution in tarantulas comes from their structural diagnoses of the nanostructures responsible for the violet/blue structural coloration and their subsequent argument that a diversity of divergent nanostructures rather than convergence argues against sexual selection. While it is true that we did not investigate the proximate mechanism of blue coloration here, one of us (Saranathan et al. 2015) has already done so elsewhere. It appears that in insects and spiders, the bulk of the nanostructural diversity is across families and not within.

      Table S6 - It is not clear to me how the values for predicted N orthologs were calculated.

      This is mentioned in line 354 of our methods – “Per the ‘moderate’ criteria from the Alliance of Genome Resources (55), hits may be considered orthologous if three or more of the twelve tools in their suite converge upon that result”.

      The Table S7 caption states: "A * indicates currently undescribed species with blue or green colour that can be confidently attributed to corresponding genus. However, as the described species exhibit no blue or green colour, we conservatively scored these as 0." Is this a conservative approach though? If they have been confidently assigned to genus, I don't understand why they would not be included.

      This refers to the cases where a hitherto undescribed species possesses the blue or green color. However, even though the species has not formally been described, its placement in the genus is not in question. We have not included such undescribed species in our tabulated number of species per genus, as it is difficult to express any such undescribed species as a fraction of the total number of species in that genus.

      Reviewer #2:

      This paper presents a broad-ranging overview of tarantula visual pigments in relationship with the color of the spiders. The paper is interesting, well-written and presented, and will inspire further study into the visual and spectral characteristics of the genus.

      We thank the reviewer for her/his/their kind words.

      First a minor remark, Terakita and many others distinguish between opsin, being the protein part of the visual pigment molecule and intact light-sensing, so-called opsin-based pigment, often generalized as a rhodopsin. The statement of line 65, 'convert light photons to electrochemical signals through a signalling cascade' is according to that view strictly not correct. Furthermore, the presence of opsins in transcriptomes may be telling, but it is not at all sure that they are expressed in the eyes, if at all. As the authors well know, in many animal species some of the opsins are expressed elsewhere. It may be informative to mention that.

      We thank the reviewer for this clarification. As for the regions of opsin expression, we very much agree – were it not for constraints of sample availability, we would also have preferred to sequence only the eyes and brain of various tarantulas that were all exposed to similar lighting conditions. However, we encouragingly see that our “leg only” transcriptomes have far fewer (often no) opsins as compared to the whole-body data.

      The blueness or greenness feature prominently in the paper, but the criteria used for determining to which class a spider belongs are not at all sure. The Color Survey and Supplementary Table S2 refer to Birdspiders.com, but that requires a donation; not very welcoming. The other used sources are also not readily giving the insight or overview which material was sampled. I therefore think that the paper would considerably gain in palatability by adding a few exemplary photographs as well as measured spectra. Of course, I am inclined to trust the authors, but I would not immediately take color photographs from the web as the best material for assessing color data with 4-digit accuracy. Furthermore, the accessible photographs do not always show nice, uniform colors, so it might be sensible to mention which body part was used to score the animals. And finally, using CIE metric might infer to many readers that the spiders are presumably trichromatic, like us. Any further evidence?

      We refer to the detailed description of our method for scoring blue or green coloration in tarantulas (l. 277-303). Briefly, we calculated ΔE (CIE 1976) difference values using between the images of each taxa against a suitable reference (average of green leaves, or Haplopelma lividum, the bluest taxa in our survey based on the b value of its images). We use the ΔE Lab values to perform quantitative ancestral state reconstruction, while we use ΔE b (for blue) and ΔE a (for green) to discretize the data for understanding trait gains and losses.

      BirdSpiders.com only requires one to enter names of genera as search terms in order to see photos that we used. However, we agree could have provided some photos of exemplars. We do realise that using pictures is not ideal, as opposed to reflectance spectrophotometry (our ideal as well), which is why we limited ourselves to a single reputable source (BirdSpiders.com) for consistent images, whenever possible. However, acquiring sample material and reflectance of tarantulas is challenging. This group is generally not well represented in museum collections (let along in digitized collections), while the pet trade is focussed on only a select number of taxa and doing field work to collect specimens is fraught with moral and ethical issues (e.g., see https://www.nytimes.com/2019/04/01/science/poaching-wildlife-scientists.html). This study nevertheless represents a substantial improvement upon a recent high-profile work that used the OSX “color picker” function (Hsiung et al. 2015).

      Indeed, available evidence on tarantula vision (including our opsin sequences) suggests tarantulas are likely trichromats (Dahl and Granda 1989 J. Arachnol., Morehouse et al. 2017) similar to jumping spiders (e.g., Zurek et al. 2015, doi: 10.1016/j.cub.2015.03.033), so we consider CIE as an appropriate color space for a putative tristimulus system in tarantulas (see also our response to Reviewer 3). Again, this underscores the need for future studies on the sensory biology and psychophysics of this enigmatic group.

      Reviewer #3:

      This neat paper continues the story of structural colour evolution in a group that is rarely appreciated for their ornamentation. The study uses colour & ecological data to model their evolution in a comparative framework, and also synthesises transcriptomic data to estimate the presence and diversity of opsins in the group. The main findings are that the tarantulas are ancestrally 'blue' and that green colouration has arisen repeatedly and seems to follow transitions to arboreality, along with evidence of perhaps underappreciated opsin diversity in the group. It's well-written and engaging, and a useful addition to our understanding of this developing story. I just have a few concerns around methods and the interpretation of results, however, which I feel need some further consideration.

      We thank the reviewer for his/her/their kind words.

      As the authors discuss in detail, this work in many ways parallels that of Hsiung et al. (2015). The two studies seem to agree in the broad-brush conclusions, which is interesting (and promising, for our understanding of the question), though their results conflict in significant ways too. Differences in methodology are an obvious cause, and they are particularly important in studies such as this in which the starting conditions (e.g. the assumed phylogeny or decisions around mapping of traits) so significantly shape outcomes. The current study uses a more recent and robust phylogeny, which is great, and the authors also emphasise their use of quantitative methods to assign colour traits (blue/green), unlike Hsiung et al.

      We thank the reviewer for his/her/their appreciation.

      1) This latter point is my main area of methodological concern, and I am not currently convinced that it is as useful or objective as is suggested. One issue is that the photographs are unstandardised in several dimensions, which will render the extracted values quite unreliable. I know the authors have considered this (as discussed in their supplement), but ultimately I don't believe you can reliably compare colour estimates from such diverse sources. Issues include non-standardised lighting conditions, alternate white-balancing algorithms, artefacts introduced through image compression, differences in the spectral sensitivities of camera models, no compensation for non-linear scaling of sensor outputs (which would again differ with camera models and even lenses), and so on (the works of Martin Stevens, Jolyon Troscianko, Jair Garcia, Adrian Dyer offer good discussion of these and related challenges). Some effort is made to minimise adverse effects, such as excluding the L dimension when calculating some colour distances, but even then the consequences are overstated since the outputs of camera sensors scale non-linearly with intensity, and so non-standardised lighting will still affect chromatic channels (a & b values). So with these factors at play, it becomes very difficult to know whether identified colour differences are a consequence of genuine differences in colouration, or simply differences in white balancing or some other feature of the photographs themselves.

      We thank the reviewer for his/her/their carefully considered thoughts and for drawing our attention to the work of Martin Stevens, Jolyon Troscianko, Jair Garcia, and Adrian Dyer in this regard (e.g. Stevens et al. 2007 Biol. J. Linn. Soc. Lond., doi: 10.1111/j.1095-8312.2007.00725.x). These are fair points raised by the reviewer. We are indeed aware that there are clear drawbacks in working solely with photographs from online sources as opposed to optical reflectance data (our ideal), but we are sure that the reviewer appreciates how challenging it is to source specimens of tarantulas. It is for this reason that we restricted ourselves to photographs from mostly only 1 reputable source (BirdSpiders.com). Furthermore, this is why we chose a perceptual model that permits device independent color representation, one that lets us separate chromatic variables from brightness, keeping in mind the underlying assumptions. However, some recent research suggests that CIELab space can perform reasonably well as compared to the latest algorithms for illuminant-invariant color spaces (Chong et al. 2008 ACM Transactions on Graphics, doi: 10.1145/1360612.1360660). Please also see our response below (to point #2) and also to Reviewer #2 above.

      Given the dearth of tarantula specimens and in the absence of spectrometry, future work will have to try and acquire uncompressed original images (with EXIF data) and could perform image processing such as homomorphic filtering and adaptive histogram equalization (Pizer et al. 1987 Computer Vision, Graphics, and Image Processing; Gonzalez and Woods 2018 Digital Image Processing, Pearson) in order to further mitigate artefacts such as those arising from differences in illumination, especially if using images from a diversity of sources.

      2) The justification for some related decisions are also unclear to me. The CIE-76 colour distance is used, and is described as 'conservative'. But it is not so much conservative as it is an inaccurate model of human colour sensation. It fails to account for perceptual non-uniformity and actually overestimates colour differences between highly chromatic colours (like saturated blues). The authors note they preferred this to CIE-2000, which is a much better measure in terms of accuracy, because the latter was too permissive (line 300). I understand the problem, and appreciate their honesty, but this decision seems very arbitrary. If the goal is to quantitatively estimate colour differences according to human viewers, then the metric which best estimates our perceptual abilities would strike me as most appropriate. Also, the fact that all species would be classified as 'blue' using the CIE-2000, when some of them are obviously not blue by simply looking at them, is consistent with the kinds of image-processing issues noted above. I only focus on this general point because it is offered as a key advance on previous work (L 40-41), but I don't think that is clearly the case (though I agree that the scoring methods of Hsiung et al. are quite vague). I'm generally in favour of this sort of quantitative approach, but here I wonder if it wouldn't be simpler and more defensible to just ask some humans to classify images of spiders as either 'blue' or 'green', since that seems to be the end-goal anyway.

      We agree that CIE 1976 is an inaccurate model of “human color sensation,” but at the same time the degree of their applicability or lack thereof to non-human tristimulus visual systems is not clear. In any case, the digital photographs do not preserve UV information anyway. We hasten to add CIE 1976 is still widely used in color science and engineering research for its simplicity and perceptual uniformity, as a simple Google Scholar search would attest. We believe that the reviewer is perhaps mistaken as to our motivation for choosing the CIE 1976 and the exact nature of the shortcomings of the CIE 1976 model, which it turns out to be an unintended advantage. Our goal was not, as the reviewer suggests, to just “quantitatively estimate color differences according to human viewers,” but to do so in a device independent fashion given the constraints of working with already available digital images, and for a putative trichromat visual system. Given there are technically no limits for a and b values in the CIE 76 space, color patches with high values of chroma are computed to have too strong a difference than in actual fact (Hill et al. 1997 ACM Transactions on Graphics, 16, 109-154). This is precisely the kind of situation that we do not face here, as we are essentially comparing shades of blue rather than for instance, chromatic contrasts between saturated blue vs. green or blue vs. red. Moreover, we only use the rectilinear rather than the polar coordinate representation of the colors (in other words, we do not compute the psychometric correlates, chroma Cab, or the hue angle hab). Contrary to the reviewer’s assertion that the CIE 1976 “overestimates color differences between highly chromatic colors (like saturated blues),” a quick perusal of Table S3 affirms that a comparison of highly saturated blues such as between our “standard” H. lividum and Poecilotheria metallica reveals they are quite close in terms of chromatic contrasts (i.e., small E values). Moreover, CIE 1994 and subsequent revisions rely on a von Kries-type transformation to account for non-uniformity of the perceptual space, but as the reviewer is well aware, without an accurate idea of the illumination conditions, use of CIE 2000 is not justified.

      Lastly, we are sure the reviewer appreciates that asking humans to manually score the colors of images (e.g. Hsiung et al. 2015) is neither reproducible nor enables quantitative analyses of trait evolution.

      3) L26-27, 53-56, 171-176: This is a more minor point than the above, but some of the discussion and logic around hypothesised functions could be elaborated upon, given it's presented as a motivating aim of the text (52-56). The challenge with a group like this, as the authors clearly know, is that essentially none of the ecological and behavioural work necessary to identify function(s) hasn't been done yet, so there are serious limitations on what might be inferred from purely comparative analyses at this stage. The (very interesting!) link between green colouration and arboreality is hypothesised and interpreted as evidence for crypsis, for example, but the link is not so straightforward. Light in a dense forest understory is quite often greenish (e.g. see Endler's work on terrestrial light environments) including at night which, when striking a specular, structurally-coloured green could make for a highly conspicuous colour pattern - especially achromatically (which is what nocturnal visual predators would often be relying on). This is particularly true if the substrate is brown rotten leaves or dirt, in which case they could shine like a beacon. Conversely, if the blue is sufficiently saturated and spectrally offset from the substrate it could be quite achromatically cryptic at dusk or night. To really answer these questions demands information on the viewers, viewing conditions, visual environment etc. The point being that it is a bit too simplistic to observe that, to a human, spiders are green and leaves on the forest floor may be green, and so suggest crypsis as the likely function (abstract L 22-23). So inferences around visual function(s) could either be toned down in places given the evidence at hand or shored up with further detail (though I'm not sure how much is available).

      We agree. Indeed, we are limited by the absence of rigorous behavioural studies. With this in mind, we have already made every effort to tone down and emphasize that our results might point towards a given function, but we do not claim it outright. It is our fervent hope that these findings will form the basis for future behavioural studies by giving researchers a starting point to test their hypotheses.

      We would like to point out that the association we uncovered is actually between arboreal taxa and the presence of green coloration and not as the reviewer says “spiders are green and leaves on forest floor may be green.” These taxa live in natural crevices on trees, shrubs and essentially spend their lives arboreally. Also, green coloration in tarantulas need not be structural in origin (see e.g., Saranathan et al. 2015) and this is why to test for crypsis against foliage, we used (pigmentary) leaves as the representative model for comparison to tarantula green colors. Although, certain lycaenid butterflies (Saranathan et al. 2010 10.1073/pnas.0909616107; Michielsen et al. 2010 10.1098/rsif.2009.0352), for instance, use structural coloration to better aid in crypsis against foliage.

      Minor comments:

      • I'm not familiar enough with with methods for creating homolog networks to comment in detail, but the use of BLASTing existing opsin sequences against transcriptomes seems straightforward enough. As do the methods for phylogenetic reconstruction.

      We agree this is straightforward.

      • L48: What constitutes a 'representative' species? And how reasonable is it to assign a value for such a labile trait to an entire genus? I understand we can only do our best of course and simplifications need to be made, but I can imagine many cases among insects (e.g. among butterflies and flies) where genus-level assignments would be meaningless due to the immense diversity of structural colouration among species (including in terms of simple presence/absence).

      Please see our response to Reviewer 2 above.

      • Line 168: Wouldn't this speak against a sexual function? Only in a tentative way of course, but the presence of conspicuous structural colouration in juveniles, which is absent in adults, would suggest a non-sexual origin to me.

      The reviewer’s inference is incorrect. We do not suggest that blue coloration is present in juveniles but absent in adults, but only that such conspicuous colors already appear in the penultimate moult right before the male creates a sperm web and is ready for mating.

    1. Author Response

      1) There were concerns about the normality tests and reanalysis to avoid pseudo-replication that must be addressed.

      We have now checked the data by two tests for normal distribution (Shapiro-Wilk and Kolmogorov_Smirnoff) and found that flight data do not follow a normal distribution. Therefore statistical analysis of flight data have now been performed using non-parametric tests. We have used the Kruskal-Wallace test followed by Dunn’s multiple comparison test for multiple comparisons and Mann-Whitney U-Test for pair wise comparisons. This information has been included in the statistical tests section in methods. Regarding pseudo-replication, as suggested imaging data have been replotted and calculated now to include just one cell, or one lobe per brain. In addition we have included individual brain traces for every experiment as supplemental data (Figure 5 - supplement F2, Figure 6 – supplement F1, F3 and F4).

      2) Discussion should be made clearer and expanded to encompass more of the literature. Specifically, the authors should expand upon the final section of the discussion to discuss more about 1) the potential context for cholinergic modulation of the PPL1-y2alpha'1 DANs (For example, consider where the acetylcholine signal onto DANs might come from. DANs may not be entirely presynaptic to Kenyon cells but might also receive input from Kenyon cells.), 2) the proposed role of these DANs (which have been studied in several contexts) and 3) modulation of innate behavior in general. The paper begins with the importance of modulating innate behavior, but the discussion on this topic is spare and focused almost entirely on research on the mushroom bodies of Drosophila. The discussion section leans heavily on summarizing the results, rather than making connections to work in other systems or networks.

      As suggested we have now addressed each of these points in greater detail in the last section of the discussion which has been expanded to two paragraphs. The possibility of cholinergic inputs from KC cells to DANs stimulating the IP3R have been included in the discussion and in the final model in Figure 7. Several other references that mention the role of PPL1-y2alpha'1 DANs in modulation of behaviour are now included – see last para of the discussion. We have expanded the last section of the discussion to include possible roles for other regions of the brain in modulating flight and references to other insect brains, where relevant.

      3) One common point raised by all reviewers was the need for expression of the itprDN during pupation which could have been due to either the perdurance of endogenous itpr vs. a developmental effect caused by the itprDN (the authors fully acknowledge the issue). This section raised many questions that aren't within the scope of this study, nor are easily resolved. Nevertheless, the authors must expand upon the implications of these results and suggest future studies will needed to resolve the issue.

      We are indeed unable to state equivocally if adult behavioural phenotypes, arising from expression of the IP3R^DN, are only pupal or both pupal and adult. We have expanded on the implications of these results both in the results (Page 9-10) and in the discussion (page 11). One way of addressing this is to express a tagged IP3R^DN specifically in late pupae and then follow it’s perdurance in adults. This experiment has now been suggested as a way to resolve this issue in the second paragraph of the discussion.

      Reviewer #1:

      The authors report experiments on Drosophila to show that the proper function of an IP3 receptor in a small subset of dopaminergic neurons is required for flight behavior. Most interesting is the fact that the requirement is restricted to a time point during pupal development. Technically, the authors report a novel dominant-negative mutant for of the IP3 receptor to interfere with its function. Physiologically, the IP3 receptor-dependent impairment in the function of the dopaminergic neurons affects both synaptic vesicle release and excitability, Also, muscarinic acetylcholine receptors are required for proper development of the flight-modulating circuit during development.

      The role of dopamine in the brain of Drosophila (as a model for general dopamine and brain function) is in the center of current research, and is studied by a large number of laboratories. More and more types of behavior are discovered that are modulated by dopaminergic neurons, and in particular those innervating the mushroom body. Therefore, the study is of very high interest for researchers working on Drosophila, but also to a broader readership.

      The experiments are well designed. with appropriate controls at place. The conclusions drawn are highly interesting and novel (dopaminergic modulation of flight behavior, perhaps in the context of food seeking behavior, molecular mechanisms of circuit maturation).

      Minor comments:

      1) A test for normal distribution of data is required to determine whether parametric statistical tests are actually appropriate.

      Done – please see response above.

      2) It is not clear to me why the authors conclude an acute requirement of IP3R during the adult state although the phenotype can arise through a genetic intervention during earlier time points in development (Page 9, lines 297ff). This has to be outlined much clearer. My interpretation of the data is: During a certain time window after pupal formation IP3 signaling is required for a proper formation of the neuronal circuit. This is likely to be not only a cell-intrinsic (i.e., cell autonomous) effect because the mAchR is also required during this time window. This provides an excellent example (there are actually only very few!) of circuit development that requires synaptic interactions between neurons. If one keeps in mind that dopaminergic neurons have reciprocal synapses with Kenyon cells (e.g. Cervantes-Sandova, elife 2017; should be included in schematic illustration!)), and these release acetylcholine onto dopaminergic neurons, a potential circuit maturation based on the concerted activity is most interesting. I suggest that the authors point out more precisely how they think the actual phenotype comes about, of course, with all due caution.

      The primary reason that we suggest an adult requirement for the IP3R in the DANs is that we see a Ca2+ response to carbachol in adult PPL1-y2alpha'1 DANs (Figure 5 – supplement 1). We put together this finding with the observation that carbachol stimulates dopamine release from PPL1-y2alpha'1 DANs (Figure 5) and that blocking vesicle release acutely in adults reduce durations of flight bouts (Figure 4) to suggest that there is likely to be an adult requirement. However, we agree that this is not conclusive and certainly does not negate a pupal requirement. As mentioned above we have addressed the pupal vs pupal+adult issue in greater detail in the results (page 9, 10) and discussion (page 11). We agree that there may be acetylcholine release from Kenyon cells at the MB synapse. This possibility has been included in the discussion and in Figure 7.

      3) Statistical tests should be done across independent brains, not across different cells in the same brains.

      We have done this. Thank you for pointing this out.

      Additional data files and statistical comments:

      A test for normal distribution of data is required to determine whether parametric statistical tests are actually appropriate.

      Done.

      Figure legend 5 C should be 5B. The scaling of the y-axis is not optimal.

      Done.

      Statistical tests should be done across independent brains, not across different cells in the same brains. This would cause a mixture of dependent and independent data. This is of importance!

      Done.

      Reviewer #2:

      The results of the individual experiments reported by the authors are convincing. The approach is rigorous and they take full advantage of the many powerful molecular genetic tools available in Drosophila. The identification of a mechanism by which a small subset of dopaminergic cells may control behavior is significant. My concerns about the manuscript are relatively minor.

      Minor comments:

      I have reviewed "Modulation of flight and feeding behaviours requires presynaptic IP3Rs in dopaminergic Neurons" by Sharma and Hasan. The authors first translated to Drosophila a dominant negative (DN) strategy first tested in mammalian cells to block the function of the fly IP3 receptor. Controls using westerns to test the expression in vivo and calcium imaging to assess inhibitory activity in an ex vivo prep were generally convincing. They then show that the DNA, RNAi and a wt transgene disrupts flight as they have shown previously using both genetic mutants and RNAi. They use genetic rescue to further show that alterations in the function of itpr in dopaminergic cells are likely to mediate at least some aspects of the flight deficit. The restricted distribution of the THD' driver was used to narrow down the identity of DA cell clusters responsible for this effect to PPL1 and/or PPL3. Additional split GAL4 lines identified a deficit when the DN was expressed in the PPL1-γ2α′1 subset of DA cells that project to the mushroom bodies. This is a key finding of the paper since it localizes the requirement of the IP3R to cells that have been implicated in other behaviors. Developmental tests using TARGET/GAL80 indicate a requirement for itpr during late development. Disruption of itpr only in the adult did not have a significant effect. This seems likely to be due to perdurance of itpr as suggested by the authors. However, these data make it difficult to determine which aspects of the phenotype are due to broad developmental deficits versus disruption of IP3R in the adult (see below). The authors next test the effects of mAhR with the idea that mAChR is likely to signal through IP3R. While it was known that developmental expression of mAcHR expression is required for adult flight, the current data more specifically that the PPL1-γ2α′1 DANs are required, enhancing the impact of the paper.

      To tie these results to vesicle recycling and release the authors use the shibere[ts] transgene in PPL1-γ2α′1. Flight bouts were disrupted via exposure to the non-permissive temperature both during late pupal development and the adult. The adult phenotype has been demonstrated previously but the developmental defect is novel. The demonstration of an effect in adults is important since it suggests loss of itpr during adulthood might also have an effect in adults even though this can't be tested due to perdurance. Expression of shibire[ts] in PPL1-γ2α′1 also disrupts feeding, and the authors next phenotype these effects with the itpr DN, indicating that IP3R expression in PPL1-γ2α′1 is required for both feeding and flight. However, here as with the flight experiments, it is not possible to directly demonstrate an effect in adults due to perdurance. They show that knockdown of mAChR also reduces feeding similar to its effects on flight and suggest that the deficits are due to disruption of the mAchR ->(Gq) ->IPR3 pathway. The suggestion of connections between mAchR and IPR3 within PPL1-γ2α′1 and the idea that PPL1-γ2α′1 controls two distinct behaviors are a significant finding and one of main contributions of the paper.

      To help link the shibire[ts] data set with and the results of perturbing mAchR and IPR3, the authors show that carbochol induced DA release is reduced, making excellent use of the relatively new GRAB-DA lines. As a control, they show that synapse density of PPL1-γ2α′1 in the γ2α′1 MB lobes are not altered. The demonstration that DA release is altered elevates the technical strength of the paper. Moreover, although further experiments might be needed to prove their model, these data support the argument that mAchR ->(Gq) ->IPR3 pathway is disrupted in the adult. The final set of experiments in Fig 6 indicate that excitability of the PPL1-γ2α′1 DANs is also disrupted by knock down or IP3R. Is it possible that this deficit contributes to the decrease in DA release by the mAchR ->(Gq) ->IPR3 and the authors nicely explain a possible mechanism and cite relevant references in the Discussion.

      The results of the individual experiments reported by the authors are convincing. The approach is rigorous and they take full advantage of the many powerful molecular genetic tools available in Drosophila. The generation of the DN transgene is a nice idea and in combination with other tools helped them to identify specific subsets of DA neurons important for the behaviors they test. However, they have previously demonstrated similar effects with mutants and RNAi, and again use them to help map the relevant cells. Since the use of the DN construct did not really go beyond the experiments using RNAi or genetic rescue, the emphasis on the importance of this reagent might be reduced in the abstract and introduction.

      Flight deficits have also been seen in other experiments on these the DANs identified by the authors. Thus, the major novel finding of this section is the demonstration that itpr is required in these cells for regulating flight. While it was previously shown that feeding behavior is also required by DAN projections to the MB, the idea that overlapping cells might control both flight and feeding is interesting. Although the idea that these two phenotypes are specifically related to each other seems somewhat speculative, one major strength of the paper lies in tying together prior observations on itpr and the DANs with their current experiments. They do this again at the cellular level using GRAB to show that carbachol induced release of DA (but not synapse density) is reduced by itpr knock-down, thus tying together data on shibere, AcHR and itpr.

      These connections make for an exciting story, and they have been cleverly woven together by the authors. On the other hand, they also represent a possible concern about the manuscript as a whole, since causal relationships between the deficits between the effects of blocking the effects of IP3R, mAcHR, neuronal excitability and vesicle release are not yet proven. It is therefore possible that all of these are relatively non-specific effects of disrupting the function of PPL1-γ2α′1 neurons. This modestly reduces the strength of the paper but is also a relatively minor concern. A second potential concern is that despite the interesting connections made by the authors as well as some exciting new data, some of the findings replicate previous data.

      It is indeed likely that loss of the IP3R in PPL1-y2alpha'1 DANs leads to both specific (acetylcholine signaling followed by neurotransmitter release) and non-specific changes (such as loss of excitability). Both are likely to have an effect on the behavioural phenotypes modulated by PPL1-y2alpha'1 DANs. We have previously shown a role for both mAchR and the IP3R in flight. However, in this work we have addressed cell specificity and mechanism, neither of which was known earlier.

      A third concern is the relationship between the effects of disrupting PPL1-γ2α′1 during development versus the adult. As the authors suggest, perdurance (of protein expression) and/or "perdurance" of previously formed tetramers could easily account for the failure of itpr and mAChR knock down in the adult to cause behavioral deficits. By the same token, it is difficult to parse out the contribution of developmental defects in the DA cells versus problems with signaling in the adult and the following issues should be addressed: the observation that synaptic bouton density is not disrupted is a good way to eliminate gross disruption of connectivity during development but does not rule out other more subtle developmental defects in neuronal function. The fact that shibire[ts] can cause effects in the adult is appreciated but does not really help us to understand what IP3R and perhaps mAcHR are doing during development.

      We agree and have tried to further address this issue in the text (see above).

      Additional Minor Concerns.

      To validate the decrease in the overall response to carbachol in Fig 1D and E, the authors show a statistically significant difference for area under the curve. A parallel metric and statistical test might be used to support the statement that the response is delayed in 1D but not 1E.

      Thank you for this suggestion. We performed the test and in fact found that both cellular and mitochondrial responses are delayed. In presence of IP3RDN. This part of the text has been modified (page 4).

      "Interestingly, the mitochondrial response did not exhibit a delay in reaching peak values." Why is that? A brief explanation might be useful.

      This is no longer the case. The sentence has been removed.

      The second explanation of how shibire[ts] works might be shortened.

      Done.

      Reviewer #3:

      General Assessment:

      This study demonstrates that IP3R signaling (triggered by muscarinic receptor activation) affects excitability and quantal content of a subset of dopaminergic neurons to modulate flight duration and food search. I had no technical concerns and am generally supportive. My only major concern was that the narrative was fragmented. I believe this is because the perspective shifted between the IP3Rs and the dopamine neurons themselves, and was too focused. I think that streamlining the narrative and providing a broader perspective for the results will remedy this issue.

      Major Comments:

      -I would like the authors to expand upon their final section of the discussion to discuss more about 1) the potential context for cholinergic modulation of the PPL1-y2alpha'1 DANs, 2) the proposed role of these DANs (which have been studied in several contexts) and 3) modulation of innate behavior in general. The paper begins with the importance of modulating innate behavior, but the discussion on this topic is spare and focused almost entirely on research on the mushroom bodies of Drosophila. The discussion section leans heavily on summarizing the results, rather than making connections to work in other systems or networks.

      We have expanded the last section of the discussion to include these suggestions (see above under consolidated review points).

      -The developmental section seemed somewhat tangential as the authors cannot distinguish between a developmental role for the IP3R from a need to express the ItprDN transgene prior to adulthood to overcome a potential slow turnover of endogenous IP3R. In essence, it was unclear how these results contributed to the overall narrative of state modulation of behavior. Is this section informative to the development of the mushroom bodies or rigorous validation of the novel transgene?

      The manuscript addresses how IP3R function impacts behaviour. In that context pupal (developmental) and adult contributions are both relevant.

    1. Author Response

      We thank the editors and reviewers for taking the time to assess our paper. We note that the reviewers seemed generally supportive of the paper, including noting that the paper addressed important questions. For context, we reiterate here our main findings:

      • a prefrontal cortex population encodes the past and the present in its joint activity, but solves the interference problem by encoding all features on independent axes for their past and their present.
      • This encoding would in principle allow upstream regions to independently access representations of the past and present in mPfC populations. We go on to show this happens: we show that only the encoding of the present, and not the past, is reactivated in sleep after training.

      In this context, the main editorial objection that we “did not control for potential confounding of behavioral variables” is not explained in the reviews; we also note that there were no “concerns about the analytical methods used” that were pertinent to our main findings. We are thus unclear about the basis for rejection.

      We respond below to the main points of each reviewer; their suggestions on terminology and of separating literature citations on rodent and primate PfC are being given due consideration.

      Reviewer #1:

      Maggi and Humphries examined how the coding of the present and past choices in the medial prefrontal cortex (mPFC) of the rats during a Y-maze task overlaps and whether they can be reliably distinguished. They found that the neural signals related to the animal's choice in the present and past are distinct and as a result they can be recalled separately, for example, during post-training sleep. Although these are very important questions and an interesting set of analyses have been applied, the results in this report are not entirely convincing, because the analyses did not successfully exclude some alternative hypotheses.

      1) The authors analyzed the signals related to the choice, light cue, and outcome separately, and this is possible because the relationship between the animal's choices and cues were decoupled by testing the animals under at least two different rules. There were a total of 4 alternative rules and different sessions included different subsets of these rules. It is possible that at least some results reported in this paper might vary depending on which of these results were tested. For example, rules might affect how the animals learned the task. Therefore, the authors should provide more detailed information about how often different rules were used to collect the neural data reported in this paper, and whether any of the results change according to the rules used in a given session.

      In the paper we did examine mPfC encoding in the trials under the two qualitatively distinct types of rule (direction-based i.e. egocentric, and cue-based i.e. allocentric), and showed that encoding of the direction, light, and outcome occurred in both rule types (figure 1e). We gave the number of sessions for those rules in the legend for Figure 1e. (We could equally decode all 3 features in direction-based and cue-based rule sessions in the inter-trial interval as well, see Maggi et al 2018, Figure 9). Thus we compared the decoding vectors across all rule-types.

      Only 8 sessions contained more than 1 rule, in the sessions in which the rule was switched. In the full analysis underlying this paper, we had also separately examined the decoding in these 8 rule-switch sessions, and found equally good decoding of direction, choice, and cue. As the paper was already dense - see e.g. Reviewer 3’s comments - we elected to not show this null result in the current version of the manuscript - it is available in version 1 of this preprint - but it can be restored if desired.

      2) The authors claim that the neural coding identified in this study does not depend on the signals in individual neurons by showing comparable results after removing the neurons with significant modulations. This logic is flawed, because the neurons without "significant" modulations might still include meaningful signals due to type II errors. Furthermore, if individual neurons carry absolutely no signals, how can a population of neurons still encode any signals? This might suggest some kind of joint coding, and the authors should not merely implicate such a possibility without more thorough tests.

      The joint coding of information by a population of neurons is the basis for the whole paper, and is tested extensively: for example, Figure 1 is about establishing that joint coding exists in mPfC. Our point on lines 91-95 was simply to show that the decoding could not be trivially explained by one or two neurons that reliably and strongly differed in the firing rates between different labels (e.g. between left or right choice of direction). To do so, we found sessions in which there were neurons with significantly detectable tuning to the task feature, omitted those sessions, and then looked at the performance of the feature decoding in the remaining sessions - and found it was just as good. Indeed, our point is precisely that it is possible for individual neurons to carry no signals detectable by classic significance testing (potentially due to Type II errors), yet for the population to be able to perfectly encode the information.

      The explanation is simply that most, and sometimes all, individual neurons do not consistently covary their firing with the changes in a feature (e.g. choose left and choose right trials) across every trial of a session. In other words, no neuron need consistently participate in encoding information. But so long as when a neuron does change its firing it does consistently vary with the feature, then across a population there are enough intermittently participating neurons on a given trial to always decode the information.

      3) The authors analyzed the activity divided into 5 different epochs, where the position #3 corresponds to a choice point and #5 corresponds to the reward site. Therefore, it is surprising that the reliable outcome signals begin to emerge from the position #3 (i.e., choice point). Is this a false positive?

      No, this replicated a common finding of outcome-predictive signals in prefrontal cortex; e.g. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).

      Fellows, L. K. Advances in understanding ventromedial prefrontal function: the accountant joins the executive. Neurology 68, 991–995 (2007).

      Sul, J. H., Kim, H., Huh, N., Lee, D. & Jung, M. W. Distinct roles of rodent orbitofrontal and medial prefrontal cortex in decision making. Neuron 66, 449–460 (2010).

      Kaplan, R. et al. The neural representation of prospective choice during spatial planning and decisions. PLoS Biol. 15, e1002588 (2017).

      We will add these references to the next version of the manuscript.

      4) The authors report that there is retrospective coding, i.e., no coding of the choice in the previous. By contrast, during the intertrial interval (while the animal's returning to the start position), the signals related to the "past" choice were still present but different from how this information was coding earlier during the trial. This is not surprising since during the intertrial interval, the animal's movement direction is opposite compared to that during the trial, so this coding change could reflect the animal's sensory environment. Whether the brain encodes the past and previous events using different coding schemes or not cannot be tested with such confounding.

      We note that the reviewer’s objection here only relates to the choice of arm direction, whereas we showed independent encoding of all three features: direction, outcome, and cue position. We can thus test how the past and present are differently encoded because we showed they are both encoded in the same set of neurons. We showed at length both here (Figure 2a&c, Supplementary Figure 5a) and in Maggi et al 2018 (Figs 5-6 and accompanying supplementary figures) that we could decode the past events from the population activity during the inter-trial interval. The information of the trial and the inter-trial interval can be decoded from the same neurons, so the question is: how can the same neurons encode both the present and the past?

      One interpretation of the reviewer’s comments is that they are concerned about the possible confounding of movement direction between the trial and the following inter-trial interval. Namely, that the turn directions are guaranteed to be opposite: e.g a left turn into the left-hand arm on the trial would mean a right-hand turn on the return journey of the inter-trial interval. However, that would mean the feature labels would be exactly complementary e.g. trial =[L L R L R] and ITI = [R R L R L]. So if the population was encoding the direction choice the same way in both the trial and ITI, then using the trial’s decoder of direction to decode direction choice in the ITI should result in a performance of 1-[proportion of correctly classified trials], meaning the classifier would be significantly below chance (and vice-versa for using the inter-trial interval’s decoder for the trials). However, we find the cross-decoding performs at chance (Fig 2).

      5) The authors tested whether the coding of present and past events is consistent using a transfer (cross-decoding) analysis. However, this is based on simply correlation, and does not exclude the possibility that neurons changing their activity similarly according to (for example) the animal's choice might also change their baseline activity between the two periods (as revealed by the analysis of "population activity" in Figure 3) or might additionally encode different variables. In this case, decoding based on simple correlation might not reveal consistent coding that might be present.

      It is unclear what the referee means by the cross-decoding analysis being “based on simple correlation”. The decoder is trained on vectors of firing rates (cf Figure 1b). The decoder assigns high weights to neurons whose activity differs most strongly between the two labels (e.g. left and right choice of direction). So a change in “baseline”, presumably meaning the average firing rate of a neuron across all trials or all ITIs, would not alter the decoder outcome. In addition to the two cross-decoding tests, we also showed the independent encoding by: (a) The angles formed by the decoding vectors trained solely on the trials and solely on the ITIs (Fig 2d-f) (b) The independence of the population rate vectors between trials and ITIs (Fig 3). Indeed, the change in population rates between trials and ITIs shown in Figure 3 is exactly those predicted by the cross-decoding results, as explained on pg 7.

      Reviewer #2:

      The study by Maggi and Humphries re-examines data by Peyrache et al. (2009), which the authors have themselves analysed previously (Maggi et al., 2018), recorded , in rat prelimbic/infralimbic cortex (see comment below on terminology). In particular, they look at the relationship between decoding of task events during performance of a trial, and during the subsequent intertrial interval. (n.b. in this study, unlike in many studies, the ITI is considerably longer than the trial period). They find that although task-relevant information can be decoded during these two periods, the information is encoded in orthogonal subspaces during trials ('the present') and ITIs ('the past'). They build on this to examine how information is encoded during sleep following training (vs a pre-training control period). They find that only the trial subspaces are reactivated during sleep, not the ITI subspaces, and more so if the rat received a higher rate of average reward.

      On the whole, I found this an interesting paper with a clear set of findings, and well-analysed data. Although the advance in some ways an incremental one on previous studies of sleep/replay, and on the authors' previous analyses of this dataset, the study will undoubtedly be of interest to researchers who are interested in consolidation of past experience during sleep. In particular, the study benefits from being able to look for two different types of information ('past' and 'present' decoders) in the same sleep recording sessions. There were a few things that I felt the authors could address:

      1) For the cross-decoding analysis in figure 2 b, it is not entirely clear from the main text which part of the trial and ITI coding is being used here. It seems to me like a more useful way of showing the cross-decoding analysis would be to show the 10x10 matrix of cross decoding accuracy for each of the 5 maze positions in both trials and ITIs. This is, I think, different from what the analysis in figure 3g is trying to show (which plots the classification error after dimensionality reduction to a 2D space).

      As we strived to explain in the text, for the cross-decoding analysis we used the decoder trained on the firing rates across the entire trial and separately across the entire ITI, in order to arrive at the most stable decoding vectors. We did not show the cross-decoding for the full 5x5 matrix of positions, as the results would be quite noisy. Nevertheless, this is a constructive suggestion, and we will add this analysis. (And indeed the analysis in Figure 3 already shows that the population activity is separable in 1 or 2 dimensions between the trials and ITIs at each maze position, so we would expect the decoder weight vectors to also be independent).

      2) It was surprising to me that the authors do not mention the finding in figure 4e anywhere in the abstract or introduction. It makes the reactivation story far more compelling if it can be linked to a change in behaviour during the preceding trials. I think this finding would benefit from not being buried deep in the results section.

      We are happy to make this result clearer. Our main finding is of the independent coding, and this result in Fig 4e does not speak directly to the independent coding results, but rather is a lovely little result to support the hypothesis that there really is reactivation of the population vectors in sleep. Because it did not speak to the main thrust of the paper, it was omitted from the abstract given the constraints on the number of words (150).

      3) The finding in figure 5 seems slightly extra-ordinary. It suggests that reactivation decoding during sleep is reliable even if very long bins of activity are used to calculate the firing rate (e.g. up to 10s). Does this relationship ever break down? Presumably with the sleep data, it would be possible to extend bins up to 1 minute, 5 minutes, etc. If there is still more reactivation at these extremely long time-bin lengths, does this mean that these neurons are essentially more persistently active? One possible way to test for this might be to project the data recorded during sleep through the classifier weights, and then calculate the autocorrelation function of this projected data (e.g. Murray et al., Nat Neuro 2014) - if this activity becomes more persistent, the shape of the ACF may change post-training.

      An excellent question. Rather than persistent activity, we interpreted the consistency of reactivation across orders of magnitude time-scales as showing that the correlations between the neurons were roughly consistent; and thus when active tended to be active in roughly the same relative order. Support for this comes from the findings in Appendix Fig A4e - the correlation matrix between neurons in the trial was more consistently found in post than pre-session sleep.

      Reviewer #3:

      This article asks the question if within trial (present) and ITI (past) task parameters are encoded in mPFC, and how encoding during these two trial epochs are encoded. They claim that firing in mPFC reflects past and present, but population encoding of past and present are independent. Further they show that the present is reactivated during sleep, not the past.

      On the face of it, this seems like an interesting paper. It is novel in that ITI encoding would be highly related to what was going on in the trial. The sleep finding is also interesting but I don't quite get the distinction between present and past for sleep. That could use some clarification.

      1) I'm not an expert in regards to this type of analysis, but throughout I was left with the feeling that I would prefer at least some single neuron data and firing rate analysis to complement the highly computational analysis, which frankly, was difficult to understand or critique by somebody who is not an expert.

      The goal of the paper is to assess the population coding in PfC of the same events in the past and the present. Indeed, as reported in the paper, we found 25-39 sessions which had no single neuron tuning at all to a given event in a trial (such as the choice of maze arm).

      2) I would have liked to see more analysis of firing correlations with behavior. It seems to me if animals were doing different things during the trial and the ITI, then it might not be a surprise that there is independent encoding.

      3) I also wonder if the finding is solely dependent on the task (which is poorly described). It seems like there should be independent coding of past and present in this circumstance because they do not feed into each other, and behavior during one is independent of behavior in the other.

      4) Relatedly, the authors suggest that independent encoding can explain how the brain resolves interference between past and present, but in this task there was no interference between past and present, and the authors do not show that when there is more or less dependent encoding that there is more or less interference. Without it is unclear how to know how important this finding is as it relates to performance and general mPFC function.

      We deal with these points together, as they are all on the behaviour in the trial and inter-trial interval in the task. Yes, the behaviour in the trial is independent of that in the inter-trial interval, so there is no “interference” of behaviour. But that is not of relevance to what is encoded in the PfC. The Introduction and Discussion both point out that the problem is interference of the encoding itself: the encoding of the past and present exists, as we show at length, so the question is: how can it co-exist in the same neurons? We indeed ask if there is no “interference” in the encoding simply because activity in the inter-trial interval is just a memory trace of activity in the trial, and rule that out.

      We cannot address when there is “more or less dependent” encoding, because the results are what they are: there is independent encoding of the same events (Figure 2).

      The task is described in detail in the Methods (pgs 20-21).

      5) Could activity reflect what the animal predicts will happen on the next trial, or what they are planning to do? It wasn't clear if that was examined.

      Whether activity in the inter-trial interval predicted what will happen in the next trial was examined in detail in Maggi et al 2018 (Fig 6), and shown here in Figure 2g. We found no encoding of the following trial’s choices, except for a very niche occurrence: an above chance decoding of the next trial’s direction choice when the rat had returned to the start position, during a learning session, and for a direction rule. In other words, as it turned to start the next trial, so there was decoding of the upcoming choice of arm.

    1. The response is the actual habit you perform, which can take the form of a thought or an action. Whether a response occurs depends on how motivated you are and how much friction is associated with the behavior. If a particular action requires more physical or mental effort than you are willing to expend, then you won’t do it. Your response also depends on your ability. It sounds simple, but a habit can occur only if you are capable of doing it.
    1. Author Response

      We would like to thank eLife editors and the reviewers for their time and effort in reviewing our manuscript, entitled: “Partial prion cross-seeding between fungal and mammalian amyloid signaling motifs” by Bardin et al. We considered carefully their comments and modified our preprint accordingly (new version posted here) and address the remarks and criticism of the reviewers in the response provided below.

      The editors’ summary of the review read as follows:

      Summary

      Bardin and colleagues identify and characterize a third prion system in P. anserina based on a cognate innate immunity signalosome comprised of PNT1/HELLP. The authors demonstrate that the three prion pathways operate orthogonally without cross-seeding; however, the newly identified PNT1/HELLP prion can be cross-seeded by the putatively homologous human necroptosis pathway when it is reconstituted in P. anserina, which further supports an evolutionary relationship between them. The review has identified substantive concerns, which limit the novelty of the work and would require significant new studies to address the mechanistic gaps. These concerns include prior work revealing several major tenets including prion activity for PNT1/HELLP in C. globosum and evolutionary conservation to the mammalian necroptosis pathway and the absence for robust experimental support for cross-seeding, or the absence thereof, membrane disruption as the cause of incompatibility, and for the relationship among toxicity, growth, protein state, and protein interaction. Concerns were also raised about the data presented, or absent, in terms of replicates, frequency of observations, and variability.

      It is our understanding that the editors and reviewers raise two types of concerns. One relates to the novelty of the work. The second type directly questions the experimental soundness of some of the presented results. We will briefly respond to the criticism regarding novelty and in detail to the methodological critique. We show the existence of a third PFD-based cell-death inducing system in Podospora, that human RHIM-motifs form prions in Podospora and that RHIM-prions partially cross-seed with PP-fungal prions. These results are nonetheless novel and do shed light on the biology of Podospora and the relation of fungal and mammalian amyloid signaling motifs. Regarding the second group of concerns, we think that by clarifying certain approaches and by giving experimental results in full detail, we are able to wave many of the criticisms. For the remaining points (essentially the question of the HELLP membrane interaction), we amend our preprint to point at the delineation of experimental results and interpretation explicitly. We gratefully acknowledge the editors and reviewers input as a mean to improve the quality of the preprint and realize in light of some of these comments that the manuscript lacked in clarity at place and that detailed results tables (that were summarized in the original preprint for the sake of conciseness) should indeed be included. But having said that, it is our intention to stand our ground regarding the central claims of the paper (as they appeared in the abstract of the preprint).

      Reviewer #1

      Bardin and colleagues identify and characterize a third prion system in P. anserina based on the PNT1/HELLP NLR-based signalosome based on the amyloid signaling motif PP from Chaetomium globosum. The C-terminal domain of HELLP is shown to exist in either soluble or aggregated states based on fluorescence microscopy of tagged protein in vivo, termed the [pi] state, and to form amyloid in vitro. These distinct states can be propagated independently and induce conversion of full-length HELLP upon cytoplasmic mixing, which leads to cell death. The PNT1 N-terminal domain also forms foci in vivo and can seed conversion of HELLP, also leading to cell death. The C-terminal domain of C. globosum HELLP and the RHIM regions of mammalian RIP1 and RIP3, which both contain PP motifs, can cross-seed HELLP conversion to the aggregated state but the other known P. anserina prions [Het-s] and [phi] are unable to do so.

      Support for the model proposed is generally qualitative in nature, with multiple instances of data described but not presented, including the timing of conversion to the aggregated state, revision of the aggregated state in meiotic progeny, the frequencies of conversion and co-localization, and the correlations between growth and prion phenotype. For the data presented, replicates, frequency of observations, and variability are not reported.

      It is unclear to us what is meant by “the model proposed”. It is not our understanding that we are proposing “a model” in this paper. The results that we claim are:

      -There is a third NLR/HELL protein pair involving amyloid signaling in Podospora

      -There is no cross-seeding between HELLP PFD and the two other Podospora PFDs (HET-s, HELLF)

      -RHIM can form a prion in Podospora

      -There is a partial prion cross-seeding between PP PFDs and mammalian RHIM in vivo in Podospora

      These are the statements made in the abstract of the preprint. It is our opinion that these central claims stand in face of the reviewers criticism. We shall attempt to provide whenever possible quantitative details regarding the points raised.

      Specifically:

      the timing of conversion to the aggregated state

      There are two types of experimental situations here. In certain sets of experiments, spontaneous conversion to the prion state is measured at different subculture durations (5, 11, 19 days of subculture) (as appears in Table 1). When induced conversion (cross-seeding) is assayed, the conversion process is measured at a single time point. Details of the timing of assay of the conversion are given in the material and methods section (and now given in Table 1).

      revision of the aggregated state in meiotic progeny

      Details of the progeny of a specific cross involving curing of the [π] prion are now given. Among 20 meiotic progeny containing the GFP-HELLP(214-271), 3 were cured.

      the frequencies of conversion

      Possibly the statement that the results are “generally qualitative” comes from the fact that several conversion experiments or barrage interaction results were presented in tables with a binary output (+ or -) in the original preprint. This presentation was chosen because the replicates of these experiments yielded only monotonous all-or-none results. All tested strains were either converted (+) or not (-). In all tables, the number of tested strains and the number of replicates per strain are now given (Table S1 to S6). This presentation results in quite boring tables but we think that this should eliminate this ambiguity.

      and co-localization

      For all co-localization experiments, in addition to representative micrographs, counts of independent observations for each phenotypes and of co-localizing dots are given in Tables S7 and S8.

      the correlations between growth and prion phenotype.

      As there is no toxic effect of prion itself in absence of HELL or HeLo containing proteins (published results for [Het-s] and [φ], and verified here for [π] and [Rhim]), this last remark appear to apply to RHIM/HELLP co-expression that results in growth defects. We observe that strains co-expressing RHIM and HELLP are affected in their growth when there are infected with [Rhim] prions. These results are presented in Table 2. We based the conclusion that the growth defect relates to acquisition of the prion phenotype because the growth defect occurs after contact with a prion infected strain. This increase in the number of strains with a growth defect requires presence of the corresponding PFD in the recipient strain. Finally, the same table presents as positive control a similar experiment with homotypic [π]/HELLP interactions.

      In addition, a mechanism is proposed to explain the toxicity associated with HELLP conversion to the aggregated state - membrane localization - but this model is not supported by robust data such as a marker for the membrane in the fluorescence images or a biochemical fractionation. Moreover, the absence of functional data, such as mutations that disrupt amyloid formation, leave the model with correlative observations to support it.

      We agree that we do not prove membrane association for HELLP. Considering the precedent of HET-S, it is however a plausible explanation for the documented cell-death inducing activity. We acknowledge that we do not provide experimental evidence based on biochemical fractionation or dual labeling that HELLP relocates to the membrane (this would probably require confocal microscopy). What we due claim however is that in this regard HELLP behaves analogously to HET-S, CgHELLP and HELLF. We have modified the text of the preprint to specifically make the statement that proof of membrane localization would require other approaches (in particular biochemical fractionation).

      The reviewer calls for mutations that disrupt amyloid formation and that should accordingly abolish HELLP toxicity. While this type of experiment is not lacking interest (this exact type of study has been made in the case of HET-S), we feel that at the present stage the fact that toxicity of HELLP is conditional and occurs specifically in interaction with [π] (not [π*] or other Podospora prions) is a sufficient support to legitimate the suggestion that HELLP functions analogously to HET-S, HELLF and CgHELLP by activation through amyloid templating.

      Finally, observations on the C. globosum system decrease the novelty of the observations.

      We address this comment below (response to substantive concern 1 of the reviewer #2).

      Reviewer #2

      This work reports the discovery of an amyloid-based cell death signaling pathway in the filamentous fungus, Podospora anserina. This makes the third such pathway in this fungus. As for the others, the amyloid in this case has prion-like activity, is selectively nucleated by a cognate innate immunity sensor protein, and results in activation of the membrane-disrupting activity of the protein. They show that all three pathways operate orthogonally - that is without cross-seeding. In contrast, cross-seeding did occur between this pathway and the putatively homologous human necroptosis pathway when it is reconstituted in P. anserina, which further supports an evolutionary relationship between them.

      Substantive concerns:

      1) The novelty of this finding is somewhat dampened by this group's prior demonstration of several of the major points of interest in previous papers. They had previously discovered and characterized the homologous pathway in a different fungus, and suggested an evolutionary link between fungal amyloid signalosomes and mammalian necroptosis using strong bioinformatic and structural evidence. In addition, they had shown that the two previously known amyloid signaling pathways in P. anserina operated orthogonally. Hence the major point of novelty, as reflected in the title, is the demonstration that this particular amyloid pathway can cross-seed the human necroptosis amyloids.

      We are honestly puzzled by this comment, shared indeed also by reviewer 1. At no place in the preprint do we claim that the discovery of the PP-motif is new, we build on preceding work on CgHELLP and claim novelty on distinct aspects. While argumenting on the significance of one’s work is somewhat of a vain enterprise, we shall nonetheless point the specific interest we see in these results. As part of our longstanding attention on Podospora as a model to study fungal PCD, we consider it of interest to document that this species contains three amyloid-activated HeLo/HELL-domain cell-death execution pathways. Bioinformatic surveys suggest the co-occurrence of several amyloid motifs in different fungal genomes, it is of interest we think to document this redundancy at a more functional level at least in one system. The present study is superior to the previous one on CgHELLP in the aspect that activity of the PP-motif proteins is being studied in their native context (not in a heterologous host that diverged from C. globosum tens of millions of years ago). Then, to our knowledge, RHIM-motifs have never been shown to behave as prions. There is a non-trivial relation of the concepts of amyloids and prions. The reviewer writes in a later paragraph that amyloids are inherently self-perpetuating but this does imply that all amyloids are prions (or vice versa for that matter). Showing that RHIM forms (like PP-motifs) a prion when expressed in Podospora, stresses we feel the functional similarity between the fungal and animal signaling motifs. The formation of the [Rhim] prions and their propagation in a fungal environment was not a foregone conclusion. It is our experience that not any amyloid sequence will form a prion in Podospora (Aβ, α-syn, etc..) and the reviewer is surely more than aware of the rich literature dealing with the amyloid/prion-relation in yeast models. The Podospora in vivo system might also be of use to others to study RHIM-assembly, for instance to screen for inhibitors of RHIM-assembly. As stated by the reviewer the major novelty is the demonstration of cross-seeding between fungal and human necroptosis pathways which has so far only been suggested on the basis of a sequence similarity on a minute motif of 5-10 amino acids in length. We feel that documenting cross-seeding does strengthen the hypothesis that these motifs are evolutionary related.

      2) Implications of "cross-seeding". The interspecific cross-seeding observed was modest; much lower than that for intraspecific templating between proteins of the same pathway. Specifically, it failed to induce a barrage, the puncta formed at different times, and colocalization was incomplete. More importantly, cross-seeding does not imply functional or evolutionary conservation. Consider the wide range of amyloid proteins that have been reported to cross-seed each other despite in some cases very different sequences, structures, and functions - for example the type-II diabetes peptide IAPP with the Alzheimer's peptide Aβ; the yeast prion protein Rnq1 with human Huntingtin; and the yeast prion Sup35 with human transthyretin. Although a direct comparison with the present data are not possible, these cross-seeding interactions appear comparably robust. The present demonstration of limited cross-seeding therefore seems not to add much additional support for an evolutionary relationship between necroptosis and fungal amyloid cell-death pathways.

      Cross-seeding is partial and not as efficient as in homotypic or intra-kingdom interactions. This is precisely our conclusion (see for instance line 470 to 473 of the original preprint). We point at this partial effect and state that it suggests both some level of structural similarity but also the existence of functionally important structural differences between RHIM and PP-amyloids. These results are in line with the fact that the consensus RHIM and PP-motifs while sharing some common position also markedly differ on others. The specificity of the cross interaction between [π] and [Rhim] prions is also supported by the absence of cross-reaction between [π] and the other Podospora prions (or between [Rhim] and [Het-s]). The same is true for the partial co-localization. These results serve as a functional context that will allow future structural data on the fold of the PP-motif to be meaningfully compared to the RHIM-structure. To insist on the partial nature of this cross-seeding underlying both relation and differences between PP and RHIM, we propose to modify the title of the manuscript to “Partial prion cross-seeding between fungal and mammalian amyloid signaling motifs”.

      The reviewer states : “More importantly, cross-seeding does not imply functional or evolutionary conservation”. Absolutely so. But when two amyloid forming regions show sequence similarity (not just composition bias) and both work as functional amyloid signaling motifs leading to necroptotic cell-death then cross-seeding is a further support (not proof) of evolutionary and functional conservation.

      3) Rigor of the fusion experiments. In all cases, despite having generated and validated the use of RFP- and GFP-labeled proteins, all fusion experiments to examine cell death microscopically (using Evans Blue staining) were between two GFP-expressing strains. This is frustrating because it makes it impossible to know from the images alone which of the two proteins is expressed in which cells, and in which cases of mycelia crossing paths is fusion occurring. I must therefore rely entirely on the labels provided, but they sometimes appear implausible. For example, the lower fusion event demarcated in Fig. 3C left panel would have been expected to allow GFP levels to equilibrate across the point of contact; instead there remains a sharp transition in GFP intensity between the two mycelia (third panel) indicating the cytoplasm is not being shared at the time of the image. In Fig. S8 top row, there is no apparent relationship between cell death and HELLP-GFP; moreover, cell death is seen occurring in mycelia containing either punctate or diffuse GFP-RIP3. While I appreciate that Evans Blue fluorescence may overlap with that of RFP (which should be stated) and preclude its visualization without multispectral imaging capabilities that may not be available to the authors, alternative viability stains and fluorescent proteins could in principle have been used to avoid this problem.

      Evans blue shows fluorescence that does indeed overlap with RFP fluorescence, which is the reason why we used GFP labeled proteins which is indeed less convenient to distinguish strains. But Evans blue staining allow clear and rapid identification of dead cells. Even with both strain labelled with GFP, strains can be identified based on diffuse versus dot-like fluorescence. Moreover, the fusion are observed in contact zone between the two strains under the microscope where the proportion of dead cells (stained cells) is drastically increased compared to the rest of the mycelium, the relative orientation and position of the filaments allows for strain identification. As for the concerns regarding equilibration levels of GFP or HELLP presence in heterokaryotic cells, it could be explained by the fact that necroptotic cell-death due to HELLP toxic effect, as for the others HeLo or HELL domain containing proteins (Seuring et al. 2012, Mathur et al. 2012, Daskalov et al. 2016, Daskalov et al. 2020), is associated with blocking of the septa to limit the spreading of cell-death through the entire mycelium. Fungal incompatibility is associated both with cell death and compartmentation of the mycelium.

      We thank the reviewer to bring to our attention the issues that may be encountered to clearly identify heterokaryotic cells on these images. Therefore, cell death imaging is presented in the new preprint using methylene blue allowing the use of RFP and GFP labeled proteins to identify unequivocally heterokaryotic cells.

      Minor Comments:

      1) The significance of these proteins forming "prions", as opposed to (merely) amyloids, should be articulated. This is important because prion-formation per se is irrelevant to the cell-level functions of the proteins, as nucleation of the amyloid state causes cell death and hence precludes their persistent/heritable propagation. Amyloid by nature is self-perpetuating at the molecular level and hence would seem to explain the properties of the protein. The discussion about possible exaptation of these pathways for allorecognition could be expanded or clarified in this regard.

      These are interesting points. Prion and amyloids are terms with different field of application. The term prion is only meaningful in vivo. We use it preferentially here, because for the most part we document prion propagation and only indirectly amyloid formation. We feel however that it might be premature to conclude that the prion-behaviour is totally irrelevant to the function of these proteins as signaling devices. This all depends (as for other prions) on the actual balance between toxicity and infectivity. It might well be that HELLP propagates part of the amyloid signal before it actually leads to cell death. Please note that even full length HET-S can be observed in certain growth condition in the form of dots and may thus partition between a toxic and an infectious fraction.

      2) Colocalization between two proteins does not imply that one has templated the other to form amyloid, even when both are capable of forming amyloid independently (see https://doi.org/10.1073/pnas.0611158104 ).

      We fully agree. We have corrected the labelling of the figures that document co-localization that were previously labelled as cross-seeding experiments.

      3) Statements of partial cross-seeding are supported by quantitation (Fig. 8). In contrast, the authors appear to use qualitative observations to support rather definitive statements about the "total absence of" (line 344) of cross-seeding between other pathways.

      Quantitative data are now given regarding the experiment presented line 344. It is true that the statement “total absence of” relates to the absence of detectable cross-seeding in the experimental setting that was use. Here in this specific case, no prion formation of [Het-s] was detected in a total of 18x2x3 infection attempts with [Rhim] prion donor strains (18 transformants for each [Rhim]-type in triplicate).

      4) Fig. S9. "Note that induction of [Rhim] in transformants leads to growth alteration to varying extent ranging from sublethal phenotype to more or less stunted growth." Can the authors suggest an explanation for this heterogeneity? From my limited perspective, it suggests the existence of amyloid polymorphisms (i.e. a prion strain phenomenon), which is quite unexpected given the lack of polymorphism among known functional amyloids in contrast to rampant polymorphism among pathological amyloids. Hence the phenomenon could be interpreted as suggesting that amyloid is not an evolved/functional state for the PP motif. In any case the phenomenon is interesting and merits further discussion.

      Phenotypic variability in this experiment can be explained by variation of expression levels of the transgene and prion curing. Transformation occurs through ectopic integration in these experiments (there are no autonomous plasmids available for Podospora). As a consequence in any given experiment, the transformants will display different copy number and integration sites of the transgene and hence variability in expression level. An additional cause of variety is “escape” a due to counter-selection when strain show self-incompatibility, fungal articles in which the transgene causing incompatibility is mutated or deleted will escape cell-death and resume growth. This is very typical of self-incompatible strains and has been largely documented and used as an experimental tool for mutant selection in Podospora and other filamentous fungi. This phenomenon typically leads to sector formation. Then in the specific case of experiments involving prion proteins in addition to these mechanisms leading to genetic variability, “escape” can also occur through prion curing. If a prion causes self-incompatibility, growth recovery occurs through prion curing (this has been largely studied in the case of the [Het-s]/HET-S interaction). We do not formally exclude the possibility that part of the variability may reflect prion strain formation but other explanations should probably be considered more likely, as indeed we have no evidence for strain formation for any of the wild –type functional prion motifs we have characterized so far in fungi.

      Reviewer #3

      Three distinct amyloid-based cell-death pathways in fungi have been reported. The authors of the current manuscript extend their previous work of the HELLP/SBP/PNT1 pathway in Chaetomium globosum and describe a similar system in P. anserina. It is shown that the amyloid signaling domain of PTN1 can form a prion in cells deleted of HELLP, which is otherwise activated by the prion to cause cell death. Using this artificial system, the authors test whether the related RHIM motif of the human RIP1 and RIP3 protein can also form a prion in P. anserina and whether RHIM amyloids as well as other fungal amyloid-forming motifs can cross-seed PTN1.

      The experiments are well executed and explained but I have a few suggestions:

      1) Amyloid cross seeding is usually assayed in vitro using purified protein fragments. The artificial genetic system used here is certainly clever but the expression level of different proteins needs to be measured for better comparison of cross-seeding efficiencies.

      We feel that the in vivo system presented here has important advantages, in particular is it less “artificial” than in vitro seeding in the sense that at least HELLP is in its native cellular context. Note also that the cross-seeding experiments are done with several distinct transformants which as explained above represent different expression levels of the transgene.

      2) Page 16, line 333-334 and Fig 8: How were recipient strains sampled? How random was it? How many samples?

      We thank the reviewer to bring this to our attention and to address these shortcomings, we added precisions on samples selection and numbers in results and in methods section.

      3) Jargons/abbreviations. Page 19, line 405; Page 20, line 429: What are PAMPs, MAMPs, and PCD?

      These abbreviations have been spelled out.

  5. Sep 2020
    1. Author Response

      We would like to thank the three reviewers for their efforts and the constructive feedback. Below, we describe how we will address the reviewers’ comments in an updated manuscript.

      Summary:

      All of the reviewers expressed concerns about the advance that the work described in the paper represents. These issues were a focus of the consultation among the reviewers. The main concern is that the work needs to go beyond demonstrating that some ganglion cells exhibit nonlinear integration for naturalistic inputs - as that point is quite well established in the literature. The comparison between natural stimuli and gratings could help in this regard, but several issues confound that comparison (e.g. differences in dynamics of the two types of stimuli). These concerns are detailed in the individual reviews below.

      Reviewer #1:

      This paper investigates how retinal ganglion cells integrate inputs across space, with a focus on natural images. Nonlinear spatial integration is a well-studied property of ganglion cells, but it has been largely characterized using grating stimuli. A few studies have extended this to look at spatial integration in the context of natural images, but we certainly lack a comprehensive treatment of that issue. The current paper has a number of strengths - notably using a number of complementary stimuli and analysis tools to study a large population of ganglion cells and linking properties of responses to artificial stimuli with those to natural stimuli. It also has a few weaknesses (some detailed carefully in the paper) - such as the inability to identify ganglion cell types (aside from a few), and to pinpoint specific circuit mechanisms. These are limitations of the techniques used. This is not a request as much as setting the context of the contribution of the paper. Generally the paper was in good shape, and the data supported the conclusions well. I do think there are a number of issues that could be strengthened. Those are listed below in rough order of importance.

      Statistical correlations in natural scenes:

      A number of analyses in the paper rely on estimating the spatial contrast from an image and comparing the dependence of various measures of the cells' responses on spatial contrast. A danger in this analysis is that spatial contrast is likely correlated with many other statistical properties of the image, so attributing a given response property to spatial contrast has some potential confounds. This issue should be discussed as a possible caveat, unless the authors can rule it out. The paper, accurately, describes the results in terms of correlations (and not causal relationships), but some discussion of the complexity of natural image statistics would be helpful.

      Spatial contrast is defined in our work via the variance of pixel intensity inside the receptive field. Indeed, spatial contrast may reflect different aspects of visual scenes, such as object boundaries, textures, or gradients in light intensity. Differences in the effects of these image features on a ganglion cell’s response will not be captured by our analysis. However, the goal of relating spatial contrast to spike count was primarily to analyze whether the spatial structure of light intensity inside the receptive field was related to the response of a given ganglion cell (beyond the mean illumination), and the pixel intensity variance provides a simple, straightforward measure of this spatial structure. To clarify this aspect and better relate it to the complexity of natural images, we will add a corresponding paragraph in the Discussion.

      Comparison of grating and natural scene spatial scale:

      The section starting around line 233 was confusing for several reasons. First, this section starts by measuring the spatial scale associated with the grating responses, and then comparing that to LN model performance for natural inputs. It's not clear why the spatial scale is the relevant aspect of the responses to gratings. Indeed, the next paragraph provides a measure of the relative sensitivity of the nonlinear and linear response components (via a comparison of F1 and F2 responses). It would be helpful to include some initial text to motivate the different measures of the grating responses and to anticipate that you will look at both spatial scale and sensitivity.

      A related issue that bears more directly on the scientific conclusions comes up later in the blurring experiments. The issue is whether it is valid to directly compare the apparent spatial scale of nonlinear responses to images (estimated via blurring) with that of the grating responses. Natural images should have much higher power at low spatial frequencies, and this may strongly impact the spatial scale identified with the blurring experiments.

      We agree that the writing may not have been entirely clear, and we will reorganize the material to discuss the extracted spatial scale and nonlinearity index in parallel as suggested. Regarding the difference in spatial scales from reversing gratings and blurred natural images: yes, it is also our interpretation that the power at low spatial frequencies plays a key role. Our main point here was to assess whether and to what degree the typical analyses of spatial nonlinearity as measured from reversing gratings translate to natural images despite the differences in spatial and temporal structure of the two stimulus classes. In a revised manuscript, we will make sure to earlier clarify the role of low spatial frequencies.

      Clustering of orientation-selective cells:

      An interesting suggestion in the paper is that the orientation-selective cells can be divided into two groups that differ in their spatial integration properties. Do these groups represent different orientations, as suggested in the text? That seems a simple piece of information to add. Related to this, I would suggest moving Figure S4 into the main text.

      We do not have information about the absolute preferred orientations of the orientation-selective (OS) cells, as we did not keep track of retinal orientation when placing the retinas on the multielectrode array. At this point, we can therefore only rely on indirect analyses of relative preferred orientations between pairs of OS cells in the same retina. These indicate that pairs of two nonlinear OS cells tend to have aligned preferred orientation (and similarly for pairs of linear OFF OS cells), but pairs of a linear and a nonlinear OFF cell tend to have divergent preferred orientations. This is shown in Fig. S4C. For a revised manuscript, we will consider integrating Fig. S4 into the main text, as suggested.

      Presentation of checkerboard stimuli and results:

      The checkerboard analysis, particularly how it isolates properties of spatial integration, could get introduced more thoroughly for a reader unfamiliar with it. A related issue is how well the chosen isoresponse contour captures structure in the full distribution of responses. In some cases that looks pretty good, but in others it is less clear. Could you add a supplementary figure or something similar that characterizes how consistent the isoresponse contours are for different response levels?

      These are good suggestions, and we will aim at clarifying the analysis as proposed and add information about the consistency of iso-response contours for different response levels. In the present analysis, the iso-response contours are used just for illustration, whereas the quantification of rectification and integration of preferred contrast are extracted from specific points in the stimulus-response space, which we found to work robustly for a population analysis without being strongly effected by threshold or saturation effects of the cells. We will explain this more clearly in a revised manuscript.

      Drift in responses over time:

      Some of the rasters - e.g. the bottom left in Figure 1C - show considerable drift over time. It is important that this drift not be interpreted as a failure of the LN model and hence indicative of nonlinear spatial integration. Can you test for drift like this across cells, and exclude any that seem potentially problematic? More generally, some assurance that the variability in the responses for a given generator signal value is real variability across images is needed.

      The presentation of all 300 natural images over ten trials takes about 50 minutes and some drift over this period seems unavoidable. To minimize systematic effects of experimental drift on the measured average responses for different images, we applied randomization within trials, which assured that all images were presented once in random order in each trial before the next trial started. In addition, to quantify the real variability over images of the average response for a given generator signal, we applied a goodness-of-fit measure (CCnorm) that takes into account variability over trials.

      We now also tested directly for the drift mentioned by the reviewer, but observed sizeable effects in only a small subset of cells that were included in the analysis. In most cases, drift corresponded to a global scaling that approximately affected responses to all images proportionally. This is reflected in a high correlation over images between the average responses of the first five and last five trials; 94% of analyzed cells had a correlation coefficient of at least 0.7. Such global scaling of responses does not affect the analysis of differences in average responses. In a revised manuscript, we will provide analyses of drift effects and exclude cells that contain drift effects that appear to deviate from global response scaling.

      Reviewer #2:

      Summary:

      Understanding how retinal ganglion cells respond to natural stimuli is a central but daunting question, which retinal neurophysiologists have begun to tackle recently. Here Karamanlis and Gollisch perform large-scale multi-electrode recordings in the mouse retina and demonstrate that the responses of many ganglion cells cannot be predicted by standard linear-nonlinear models (L-LN). They go on to test a variety of clever artificial stimuli that emphasize and allow for the quantification of the non-linear aspects of RGCs responses and convincingly demonstrate that non-linear processing is associated with sensitivity to fine spatial contrasts (subunits) and local rectification. While these aspects of RGC receptive fields have been previously described, demonstrating their applicability to natural vision is a significant advancement.

      Major Comments:

      My first main concern is with the way the paper is written. It does not highlight the significant advancements but rather emphasizes what is already known from other studies. For example, many of the conclusions of non-linear spatial integration & signal rectification arising in bipolar cells have been well described previously. By contrast, novel aspects like the sensitivity of reversal gratings being unrelated to LN model performance for natural scenes should be explained more in detail. The authors should more clearly state the major advancements that are being made here beyond what has already been shown previously (e.g. Turner and Rieke, 2016)

      It is possible that our efforts to provide context by relating our results to established findings in retinal signal integration overshadowed the novel aspects of our work. As suggested, we will aim at pointing out these aspects more clearly. For example, compared to the work of Turner and Rieke (2016), we a) focused on a different species with more diversity in accessible RGC types, b) generalized the connection of spatial integration and natural scene encoding to a wider range of cell types (e.g. including also spatially linear and nonlinear ON-OFF cells as well as cells that are inversely sensitive to spatial contrast), and c) developed methods to assess and quantitatively characterize subunit nonlinearities with multielectrode recordings of many cells in parallel, without the need for intracellular recordings or knowledge of the receptive field location.

      Second, the authors never include non-linear subunits in their model to demonstrate improved performance. Testing models with filters that incorporate rectification and convexity as experimentally determined will enable them to show their utility more convincingly. Without this, the reader is left with the conclusion that there are RGCs that exhibit non-linear or linear spatial integration (already known) and that non-linear integrators cause LN models to perform poorly with natural images (Turner and Rieke, 2016).

      The aim of the present work was to assess how well models with linear receptive fields account for responses to natural images in various cells of the mouse retina and whether the models’ shortcomings can be related to the cells’ spatial stimulus integration characteristics. While we agree that models with nonlinear subunits could help support the conclusions, fitting such models to recorded data is – we believe – beyond the scope of the current manuscript. The many parameters of nonlinear subunit models, such as the number, shape, and layout of subunits or their nonlinearity and weight, all likely vary considerably across the diverse population of cells in our recordings. To avoid extensive parameter fitting, simplified models with ad hoc selection of subunit layouts and nonlinearities could help assess whether spatial nonlinearities are important, as in the work by Turner and Rieke (2016). Instead, as an alternative, we chose to analyze the importance of spatial nonlinearities via the effect of spatial contrast in images with similar mean intensity in the receptive field (e.g. Fig. 2). For our data, an advantage of this approach is that it is directly applicable to cell types with diverse spatial integration characteristics, such as the cells that are inversely sensitive to spatial contrast, which wouldn’t be captured by a standard subunit model with rectifying subunit nonlinearities. In future work, however, we plan to analyze subunit models that can account for the diversity of observed response patterns.

      Third, I'm not sure how 'natural' their natural images are, given static images are flashed over the cell intermittently. While such stimuli might simulate some sort of saccadic eye movements, whether this is relevant for mouse vision is not clear. Would linear models be more predictive for responses to natural movies? Some discussion on this issue would be helpful.

      Rather than aiming for fully natural movie-like stimuli, we used flashed images in our work to focus on aspects of spatial integration. This indeed entails a simplification of the temporal structure of natural stimuli, which was intended, but it preserves natural spatial structure, such as the occurrence of objects, boundaries, textures, and intensity gradients, as well as continuously decreasing power for higher spatial frequencies. Nonlinear spatial integration in the presence of this natural spatial structure will likely also shape responses under natural movies. To clarify this approach, we will re-evaluate our wording regarding the application of natural stimuli in our work and discuss the simplification compared to natural movies, as suggested.

      Reviewer #3:

      The manuscript by Karamanlis and Gollisch examines the responses of mouse retinal ganglion cells (RGCs) to natural stimuli. The primary conclusion of the manuscript is that spatial integration of stimuli within the receptive field is nonlinear. This nonlinear integration is consistent with "local signal rectification". This results in a set of RGCs that are sensitive to spatial contrast within the RF. The Authors also note the presence of cells that are suppressed by contrast and cells that prefer uniform stimulation of the RF. To reach these conclusions the authors use multi-electrode array recordings from isolated mouse retina. Spatial RFs are estimated using white noise stimuli, which are then used to generate a null-model for linear spatial summation. They compare predictions of this null-model to the responses of the same RGCs to briefly flashed natural images. The authors find some RGCs that are consistent with this null model and many that are not consistent. The authors correlate deviations from linear spatial summation to deviations revealed by contrast reversing gratings. They also used a mixed-contrast, flashed-checkerboard paradigm to map the contrast tuning and rectification of RF subunits. Finally, the authors show that some of these results track with functionally distinct RGC types such as direction-selective and "IRS" RGCs.

      The data and analyses presented in this manuscript are high quality. However, I think the study is largely consistent with many previous studies that demonstrate nonlinear spatial integration among RGCs in the mammalian (including mouse) retina. I think the Authors view the use of natural stimuli as a major departure from previous work, but I'm not convinced of this for two reasons. First, I don't see a compelling reason to think that results using contrast reversing gratings or other 'textured stimuli' (e.g. Schwartz et al Nat Neuro 2012) would fail to generalize to flashed natural scenes. Second, the implicit claim here is that a 200ms flashed natural scene interleaved with an 800ms gray screen is a natural stimulus. I think this assumes a lot about the space-time separability of the RF mechanisms, and these assumptions are not well justified.

      Major Concerns:

      1) I think the introduction of the manuscript is building a straw man argument, suggesting that many (or most) scientists think the retina is predominantly linear. A pubmed search of 'retinal ganglion cell' and 'nonlinear' produced more than 300 studies. Specifying subunit nonlinearity produces 28 studies. The discovery of subunit nonlinearities is roughly 50 years old and many manuscripts demonstrate Y-like receptive fields are more common across RGC types than X-like receptive fields.

      The goal of our work was not to show that receptive fields of mouse retinal ganglion cell are (often) spatially nonlinear, but to test whether these nonlinearities matter for natural images. It is conceivable that spatial nonlinearities as measured with typical artificial stimuli such as spatial gratings or spatiotemporal white noise are not (as) relevant for natural images because the simultaneous occurrence of strong positive and negative contrast inside a receptive field is much rarer in natural images. Indeed, in our work we find that traditional measurements of spatial nonlinearities with reversing gratings do not provide a robust quantitative prediction of whether spatial nonlinearities matter under natural images for a given ganglion cell. As laid out in the Introduction, there is surprisingly little research yet on how spatial nonlinearities affect the encoding of natural images, and in a revised version of the manuscript, we will aim at clarifying that this is the focus of our work here.

      2) The authors seem to be arguing that the spatial nonlinearities engaged by the contrast reversing gratings are not the same as those engaged by their natural scenes (Figure 3). However, I think the authors are assuming too much that the spatial and temporal components of the RFs are separable. The flashed natural scenes are interleaved with relatively long gray screens. The contrast reverse granting are reversed in a square-wave fashion with no interleaved gray screen. These distinct spatiotemporal dynamics in the stimuli seem likely to explain the difference. This would also seem likely to explain why the flashed checkerboards in Figure 4 produced results more correlated to flashed scenes in Figure 1. In summary, I don't see a strong reason to think the authors are observing anything other than subunit rectification of the sort described by Hochstein and Shapley in the 1970s and followed up in many subsequent studies.

      We do not think that spatial nonlinearities as observed with reversing gratings or with natural stimuli are related to different mechanisms. The point of our analysis was rather to assess whether typical assessments of spatial nonlinearities with reversing gratings allow quantitative predictions about the relevance of spatial nonlinearities under flashed natural images, and we find that this is often not the case. We believe that this is largely due to the differences in spatial structure, in particular, the prevalence of high-contrast edges in the gratings. Yet, indeed, differences in temporal stimulus structure might also contribute. We actually tested flash-like presentations of gratings in some of our recordings, and results were quite similar to those obtained with contrast-reversing gratings and led to the same conclusions. We will describe this in the revised manuscript for clarification.

      3) It is not clear to this reviewer that flashed natural images interleaved by a gray screen is qualitative more natural than white noise, sinusoidal gratings, or square-wave gratings.

      The spatial structure of natural images is the focus of the present work. It is in this aspect that flashed photographs are more natural than typical artificial stimuli like spatiotemporal white noise or gratings. In particular, natural images contain a broad spectrum of spatial frequencies with relatively more power at smaller frequencies, and they combine occasional edges with intensity gradients and textures. Gratings, for example, are characterized by high power at large spatial frequencies, that is, high spatial contrast, which is well suited for triggering effects of spatial nonlinearities but occurs much more rarely in natural images. Thus, understanding whether spatial nonlinearities are important in a natural setting requires considering stimuli that match the natural spatial structure. It seems likely that nonlinear spatial integration observed under flashed presentation of natural images remains relevant when stimuli are supplemented with natural temporal structure, even though the latter may likely trigger additional effects that shape the responses (e.g. adaptation or nonlinear temporal integration).

      4) The null-model constructed by the authors in Figure 1 assumes the RF follows a specific functional form (e.g. Gaussian). However, many studies show that individual RFs frequently exhibit strong deviations from a Gaussian RF. To what extent are the deviations from the null model produced by deviations from linear summation or just linear mechanisms that deviate from the specific parametric form imposed by the model?

      Measuring the detailed structure of receptive fields (RFs) with high precision from time-limited experiments is a challenge, and using a fitted (elliptical) Gaussian profile is a standard procedure for limiting the effect of noise in the RF structure. We also tried using the pixel-wise spatial profile obtained from the reverse-correlation analysis as a spatial filter, but results were similar, yet often more noisy. We therefore settled on the standard procedure of using a Gaussian fit to the RF. Deviations from the Gaussian profile can indeed contribute to deviations of the model. Yet, for natural images, which have most of their power in low spatial frequencies, these deviations are likely to be small. Furthermore, our subsequent analyses show that the Gaussian RF model provides a useful baseline because it allows us to extract the relation between model deviations and image structure. In addition, the results from the model analysis were supported by the findings under presentation of blurred natural images, which did not require any assumptions about the underlying RF model. In a revised manuscript, we will point out that relying on Gaussian RFs is a choice that we make and that deviations of the receptive field structure may contribute to decreased model performance, but that the subsequent analyses support the usefulness of the applied Gaussian RF model.

      5) It was unclear how the authors rule out the contribution of differences in (nonlinear) temporal integration to the effects in this study. In general, RGC RFs are not space-time separable, and it seems that the analyses in the manuscript assume they are.

      Our choice of using flashed images as stimuli with no temporal structure beyond onset and offset and assessing responses via elicited spike counts was motivated by focusing on spatial stimulus integration and minimizing effects of temporal processing. Nonetheless, our extraction of receptive fields from measurements under spatiotemporal white-noise stimulation uses a space-time separation of the spike-triggered average. Thus, the lack of space-time separability of ganglion cell receptive fields can contribute to the putative underestimation of surround components, which we have discussed in the manuscript. In a revised manuscript, we will add an explicit reference to the issue of space-time separability.

      6) This study overlaps significantly with Cao, Merwine and Grzywacs (2011), 'Dependence of retinal Ganglion cell's responses on local textures of natural scenes', Journal of Vision. This article is not cited here, but in my view, the major conclusions are similar.

      Thank you for pointing us to this paper, which is indeed relevant for our work. Both the Cao et al. paper and our manuscript evaluate the effect of spatial contrast in natural images by relating spatial contrast to response deviations from a linear-RF model, albeit with different methods. An important difference, apart from the different species, is that our work then focuses on relating the identified effects of spatial contrast to functional characterizations of the specific nonlinear operations inside the receptive field (e.g. rectification). Furthermore, we also focus on the diversity of spatial-integration properties between cells and cell types, including the description of spatially linear cells and cells that are inversely sensitive to spatial contrast. In a revised manuscript, we will add a comparison to the methods and results from Cao et al.

      7) In my experience, the strength of subunit rectification can be labile during ex vivo experiments. What controls have the author's performed to ensure the effect they are studying remain stable over the duration of their recordings?

      Experimental rundown could, of course, affect subunit rectification as well as other response aspects, such as overall sensitivity. However, we observed that responses for different repeats of the same natural images were typically quite stable over the course of the hour-long stimulus. As also discussed in the response to Reviewer 1, we now analyzed how responses to late trials deviated from responses to early trials and found that only a small subset of cells displayed sizeable drift. Furthermore, those cases were mostly affected by a global drift in response size, keeping the relative responses for different images approximately constant. (For 94% of cells, the correlation of images was larger than 0.7 between average responses for the first five and for the last five trials; approximately on the level of estimated random trial-by-trial variability.) This indicates that the features of stimulus integration did not change substantially over the course of the experiment. In addition, nonlinearities as assessed with our flashed checkerboards were strongly correlated to nonlinearities under natural images, despite the fact that these stimuli were applied 1-2 hours apart. Thus, the strength of subunit rectification appears to be sufficiently stable to allow comparison over different stimuli.

    1. Author Response

      We would like to thank all three reviewers for their great effort and their helpful and detailed comments on our manuscript. The reviewers noted the significance of the novel concept we present here, however, major weaknesses of the manuscript were cited in the comments from each reviewer. The criticisms can be summarized into three major categories: 1) missing key controls and analyses in the HEK293 cell models we used; 2) the HEK293 cell models being the only system used for this study; and 3) some evidences that support the mechanistic conclusion are based on correlations and lack direct demonstration for causality. We have addressed some of their concerns in the updated version of the manuscript and believe that it improved our manuscript. We would like to also briefly respond to the comments here:

      First of all, we apologize for not including some key controls and analyses in our manuscript. We have now revised Figure 1 and added 5 additional Supplementary Figures to provide those controls and analyses. The mistake was caused in part by our lack of perception from an audience point of view. Our HEK293 cell system has been rigorously validated for studying TyrRS nuclear deficiency at endogenous level of expression. Those evidence were published (Wei et al., 2014, Molecular Cell, PMID: 25284223) and cited in this manuscript. But this clearly was not enough; each new experiment needs to have its independent controls and analyses, which we did preform and confirm but failed to include in the original manuscript. This mistake caused major confusion and a lack of confidence in our conclusions. Now those controls and analyses have been included in the revised manuscript as listed below:

      Supplementary Figure S1 shows that 1) the ΔY/YARS and ΔY/YARS-NLSMut HEK293 cells we generated express TyrRS (WT or NLS mutant) at a level similar to endogenous TyrRS expression in the original, unmodified HEK293 cells; 2) H2O2 treatment stimulates the nuclear translocation of TyrRS; and 3) ΔY/YARS-NLSMut cells are deficient in TyrRS nuclear localization with or without H2O2 treatment.

      Figure 1A is expanded to include nuclear fractionation and Western blot results as controls to show that 1) overall and cytosolic levels of TyrRS (WT or NLS mutant) do not change obviously during H2O2 treatment; and 2) ΔY/YARS-NLSMut cells are deficient in TyrRS nuclear localization with or without H2O2 treatment.

      Supplementary Figure S2 shows equal expression of different transgenes in our experiments (Figure 1C and Figure 2D).

      Supplementary Figure S5 is added to strengthen the evidence that co-factors are required for TyrRS to regulate target gene expression. Because HDAC1 is a shared co-factor for both TRIM28 and the NuRD complex, we used an HDAC1 inhibitor Trichostatin A (TSA) to test if it can affect the transcriptional repressor activity of TyrRS. Indeed, TSA treatment blocks the inhibition effect of overexpressed TyrRS on its target gene transcription.

      Supplementary Figure S6 shows equal expression of WT and E196K TyrRS and the gain-of-function effect of the E196K mutation in suppressing target gene expression and protein synthesis.

      Supplementary Figure S7 shows the quantification analysis of caspase-3 cleavage as detected by Western blot analysis in Figure 5B.

      For the second major criticism which is the sole use of the engineered HEK293 cell models in the study, we agree that the main conclusions of this paper need to be confirmed in an additional cell system and ideally with the endogenous TyrRS. In fact, we have generated TyrRS nuclear deficient mice by mutating the NLS of the endogenous YARS gene and, by using the mouse fibroblasts, we have confirmed that protein synthesis is overactivated in TyrRS nuclear deficient cells. Because the study of the mouse model has not been completed and it is a separate in vivo study of nuclear TyrRS with its own objectives, we prefer not to add the mouse fibroblasts data to this manuscript but will share these data with the reviewers. However, we would like to point out that the ΔY/YARS and ΔY/YARS-NLSMut HEK293 cell lines are not stable cell lines derived from single clones but instead transient transfections that were selected for in bulk. Therefore, they originated from the same starting cell line and diverged only 1-2 passages before the experiments were performed. Genetic diversion between the NLSMut and the control cell line should therefore be limited. We apologize if that was not clear from the Material and Method section.

      For the last major criticism, we acknowledge that some mechanistic aspects of nuclear TyrRS have not been unequivocally demonstrated. For example, whether the direct binding of TyrRS to its target genes and the interactions of TyrRS with TRIM28 and/or NuRD complex are responsible for the endogenous TyrRS to regulate target gene expression in cells, and whether the level of transcriptional regulation on protein synthesis genes by nuclear TyrRS is sufficient and responsible for the observed suppression in cellular protein synthesis activity. While this issue is partially addressed by the new Supplementary Figure S5 (Treatment with an inhibitor of HDAC1, the shared co-factor of TRIM28 and the NuRD complex), we acknowledge that these weaknesses are in part due to the use of ectopically expressed TyrRS in the current system and can be addressed in the future by using the mouse fibroblasts mentioned above.

    1. Author Response

      Summary:

      As you will see the reviewers agreed that the premise behind this manuscript is important and timely both in the context of basic auditory science and for informing technology. However, they raised largely consistent concerns about the generalizability of your observations to other auditory stimuli and to more naturalistic listening conditions.

      We appreciate the reviewers’ positive assessment underpinning the significance and timeliness of our present research endeavours. We assume generalizability of our findings to more naturalistic listening conditions because the proposed model framework successfully explained the outcomes of experiments that were conducted under listening conditions differing in reverberation and source stimuli. Those differences, however, only occurred across but not within experiments and thus were not considered in the model explicitly. The set of experiments and relevant cues was chosen such that the investigation of decision strategies for the combination or selection of cues in the context of perceptual externalization could be conducted on a limited but still divers set of cues. The proposed framework allows to easily extend the set of cues. For example, in another work (see Li et al., in press), we successfully modelled the impact of situational changes of the amount of reverberation on externalization perception by extending the framework to reverberation-related cues. This further strengthens our assumption that our findings can be generalized. Nevertheless, we understand that more direct evidence for this generalizability would further increase the confidence in the conclusions we draw.

      Reviewer #1:

      I agree with the authors that the question at the basis of this work is timely and important both from the point of view of understanding auditory perception and for informing technology. However I am not convinced that the findings here will necessarily generalize to other stimuli/listening situations.

      I think the biggest limiting factor here is that the primary data on which the modelling is based are drawn from many different studies which used different stimuli, different tasks, different presentation environments and different equipment). I can see how testing the model on existing data is an important first step, but I would think that a critical next step is to form a set of (contrasting) predictions to be tested on a single stimulus set, within a single group of participants, as a way of confirming model validity. In this experiment I would also avoid using static non-reverberant environments since we know that these factors greatly affect spatial perception.

      We do not follow the reasoning why the above mentioned diversity of experimental paradigms is a limitation. On the contrary, in our opinion, the diversity of the considered experiments demonstrates robustness of our findings for a variety of experimental procedures. We agree that an additional validation experiment would further strengthen our study, but we question its necessity and still believe that the present modelling work is extensive and compelling enough to warrant publication.

      Other comments:

      1) The title greatly overstates the main findings, it would be toned down.

      In the title, we aimed at describing the research topic in general terms accessible to a broad readership. We take your comment as an advice to state the main findings instead.

      2) Intro, line 30-33 this statement is misleading. As written it appears to claim temporal aspects of auditory perception are based on short term regularity, whilst spatial perception is based on long term effects. This is not correct see e,g Ulanovsky 2004.

      Agreed. We will remove the sentence or rephrase it in more general terms because the misleading distinction is actually irrelevant to our study.

      3) As a reader not highly familiar with the auditory spatial processing literature I found the results section very dense and hard to follow. If you are targeting a general audience it is important to clarify concepts, avoid using abbreviations where possible etc.

      Thank you for your advice. We will aim to increase the level of abstraction within the results section.

      4) When discussing the various decision strategies which you tested, consider explaining how they might be implemented by the auditory system, at which stage of processing etc.

      Our study approached the problem from an algorithmic point of view and did not touch upon the more detailed level of neural implementation. While the cue processing has a clear neurophysiological basis in the subcortical layers of the auditory system, we will include some speculation about the involved cortical networks in a revised version of the manuscript.

      5) It is very difficult to evaluate your results without more information about the stimuli and studies from which they were taken. Whilst you do provide references, I think the paper would be much clearer if you provide a more complete description of the stimuli (even in table form; paradigms etc).

      We appreciate your advice and will provide more details about the simulated experiments in a table.

      Reviewer #2:

      The current study compares four decision rules, factoring in seven potential acoustic cues, for predicting perceived sound externalization for single-source binaural sound with stationary interaural cues. Test stimuli included a harmonic vowel complex, noise and speech. Results show that monaural and binaural cues shape externalization. However, how listeners weighted these cues varied across the tested conditions. The authors consider the fact that some of these cues covary acoustically, by additionally testing their model on subsets of two of these cues only. No single externalization cue emerged as a clear predictor for perceived externalization. However, overall, a static cue weighting strategy tended to outperform dynamic cue weighting for predicting externalization.

      Major concerns dampen enthusiasm for the current work.

      1) It is unclear what neural mechanism is being tested. A premise of the current approach is that perceived sound externalization is primarily driven by acoustic cues. However, we know this not to be true. Context matters. As pointed out by the authors (l370-372), when listening to sounds processed with head related transfer functions (HRTFs) over headphones, listeners can externalize sound better when the context of the test room matches the room where HRTFs were recorded (Werner and Klein 2014).

      Sound externalization is an auditory percept and as such primarily driven by acoustic cues. How those cues are used for perceptual inference is certainly context dependent. From the present study, we conclude that the auditory system evaluates deviations from a small set of expected acoustic cues in a fixed weighted (and not selective) manner. We further explain that these expectations, which are represented as templates in the model, must be adaptive to the context. This is well in line with your example of room divergence (Werner and Klein, 2004): listeners are thought to establish expectations about reverberation-related acoustic cues and evaluate incoming sensory information against those expectations with a fixed weighting between cues. If expectations are not met (i.e., acoustic cues deviate from their templates), perceptual externalization degrades.

      2) Most external sounds are neither anechoic nor stationary. Therefore, any neural decision metric on externalization must have been shaped by lifelong experience with dynamic, reverberant cues for interpreting externalization. The current work mostly models stationary single source sound that was either anechoic or mildly reverberant, providing pristine spatial cues. I do not follow the author's point that this would not matter (l498-502): "While the constant reverberation and visual information may or may not have stabilized auditory externalization, they certainly did not prevent the tested signal modifications to be effective within the tested condition. In our study, we thus assumed that such differences in experimental procedures do not modulate our effects of interest." That is an untested assumption.

      Others showed that the type of spectral manipulations we considered remain effective also if reverberation is present (e.g. Hassager et al., 2013) and if listeners are exposed to dynamic cues by moving their heads or the sound source (Brimijoin et al., 2013). We used the above-mentioned argument in order to motivate why we ignored certain differences across studies in the first place and the high explanatory power obtained with the proposed model framework suggests that this simplification was adequate. We agree that the above-mentioned sentence can be easily misunderstood and we will modify it by including the explanation stated here.

      3) Many of the current test stimuli are perceived as ambiguous - providing 50% externalization ratings - and thus do not provide a sensitive test of brain mechanisms of sound externalization.

      The field mostly agrees that auditory externalization is not a binary phenomenon but a matter of degree – we very recently published a review article that discusses this issue in detail (Best, et al., 2020). Hence, the experimental outcomes, denoted as externalization scores, ranging from 0 to 1 indicate the degree of externalization that is considered to mediate perceived egocentric distance. The externalization scores do not indicate the level of perceptual ambiguity.

      We will include this explanation in the manuscript in order to prevent further misunderstanding.

      4) Reverberation enhances perceived externalization, but this cannot be predicted by any of the tested decision metrics which only consider stationary monaural or binaural cues.

      True, there are also other cues potentially affecting the degree of auditory externalization. Reverberation-related acoustic cues are one of them. The main purpose of our study was to identify the basic functional mechanisms that integrates or selects between various cues – the purpose was not the identification of all possible cues that may affect auditory externalization. Thus, we chose a set of experiments that can be narrowed down a priori, particularly allowing to ignore reverberation-related cues.

      For the effect of reverberation-related cues, we point interested readers to another modelling study (Li et al., in press) that we conducted in parallel, in which we applied the here proposed framework also to reverberation-related cues and obtained good predictions.

      On balance, this reviewer is unconvinced that the current work will generalize to realistic dynamic and reverberant conditions.

      We agree with the reviewer that our study does not address dynamic and variable reverberant conditions. It was by-design limited to static conditions with fixed reverberation because we had no reason to believe that the targeted decision strategies applied to combine or select cues would be fundamentally different in more complex conditions.

      S. Werner and F. Klein, "Influence of Context Dependent Quality Parameters on the Perception of Externalization and Direction of an Auditory Event," presented at the AES 55th International Conference: Spatial Audio (2014 Aug.), conference paper 6-4.

      Reviewer #3:

      The manuscript "Decision making in auditory externalization perception" aims to identify cues that create/hinder an auditory externalization percept by using a template-based modeling approach. The approach as well as the findings are very interesting, and the study is thoroughly conducted. However, the manuscript adds little new knowledge to the field. Furthermore, a critical discussion is missing. The authors use a template-based model, but do not discuss the possible problems with such an approach. Particularly as each condition uses another model fit. This potentially allows the model to use cues that the auditory system cannot or does not consider. Nevertheless, the approach can still teach us which cues are potentially important for auditory externalization.

      1) The title seems inappropriate as the main work seems to be on the identification and combination of cues for externalization but not on the decision making.

      In combination with Reviewer #1’s first comment, we understand that the title could have been more specific. We will change the title accordingly.

      2) The model needs a more detailed explanation in the introduction. Otherwise the result section is not understandable without consulting the methods section.

      We will carefully re-evaluate which methodological details are necessary to understand the results section on a more abstract level.

      3) Add a Discussion on template-based models and fitting conditions. The risk of mathematical inspired models is that features are exploited that the auditory system cannot access. A more sophisticated front-end than a gammatone filterbank might reduce this risk. Alternatively, the use of physiologically inspired front-ends as in Scheidiger et al. (2018) might be interesting to consider. Nevertheless, I acknowledge that some of the features used in this study are backed by physiological and psychoacoustical studies.

      We agree with the concern behind the use of efficient functional approximations of the auditory periphery. Interestingly, however, we are very confident that this particular approximation does not provide spurious cues, especially in the context of monaural spectral shapes, because we did cross-validate the effectiveness of those cues with a physiologically more accurate model (Zilany et al., 2014) in previous work (Baumgartner et al., 2016).

      We will incorporate a corresponding explanation in the manuscript.

      4) It is known that the monaural spectral shape is important for externalization, for example from the studies that you have used. Thus, I partly question the novelty of the findings.

      We partly agree. It has also been suggested that interaural spectral cues are important for externalization perception. Further, it is also known that other cues contribute (e.g., reverberation-related cues as already discussed in response to the comments of Reviewer #2). Now, which cues contribute to which degree and how are they integrated? This is the main research question behind our study, with the ultimate goal to better understand the mechanisms of cue integration in the context of a perceptual inference task.

      5) I am not too familiar with template based models but I wonder if there is a problem if you use your models to fit and test with the same datasets?

      Cross-validation (i.e., using separate data sets for fitting/training, validating, and testing) is particularly important for complex models that allow overfitting. Such models can often be very closely fit to comparably small sets of data and thus the goodness of fit is not discriminative between those models. Here, in contrast, we compared the goodness of fit for models that contained a rather small and equal number of model parameters and this goodness of fit did strongly differ across models and was therefore informative for model selection in itself. If we separated the data sets, we would need to jointly assess the differences in initial model fits (to training data) together with the differences in predictive power (for testing data).

      References:

      Baumgartner, R., Majdak, P., & Laback, B. (2016). Modeling the effects of sensorineural hearing loss on sound localization in the median plane. Trends in Hearing, 20, 2331216516662003.

      Best, V., Baumgartner, R., Lavandier, M., Majdak, P., & Kopčo, N. (2020). Sound Externalization: A Review of Recent Research. Trends in Hearing, 24, 2331216520948390.

      Brimijoin, W. O., Boyd, A. W., & Akeroyd, M. A. (2013). The contribution of head movement to the externalization and internalization of sounds. PloS one, 8(12), e83068.

      Li, S., Baumgartner, R., & Peissig, J. (in press). Modeling perceived externalization of a static, lateral sound image. Acta Acustica.

      Zilany, M. S., Bruce, I. C., & Carney, L. H. (2014). Updated parameters and expanded simulation options for a model of the auditory periphery. The Journal of the Acoustical Society of America, 135(1), 283-286.

    1. Author Response

      Reviewer #1:

      This manuscript provides evidence that drug administration during a reconsolidation window does not necessarily prevent memory recall, as has been shown by many groups. The authors attempted to replicate several published experiments and despite demonstrating that the drugs had other effects on the animals' behavior and physiology (e.g. weight gain), no effects on memory were observed.

      The paper is nicely prepared.

      We sincerely thank the reviewer for these kind words and the support to publish our replication efforts.

      Reviewer #2:

      General assessment:

      In this study, Luyten et al. aimed to replicate post-retrieval amnesia of auditory fear memories reported numerous times in the literature. They used a variety of behavioural approaches combined with systemic pharmacological treatments (propranolol, rapamycin, anisomycin, cycloheximide) after reactivation of fear memories. Interestingly, none of the treatments induced a significant decrease of freezing responses during subsequent retrieval tests. Authors strengthened their null results by using Bayesian statistics, confirming the absence of drug-induced amnesia.

      Overall, the study is really interesting. Experiments and analyses are very well designed and bring some important findings to the debated topic of post-retrieval amnesia and its clinical relevance.

      We are grateful that the reviewer appreciates our work and recognizes the general importance of our null findings. We genuinely thank them for the time that they took to evaluate our paper in detail and hope to provide some clarifications in our responses below.

      I have nevertheless several comments for the authors to consider.

      -Despite being very detailed, the authors should clarify and uniformize their Methods section and Supplemental information (e.g. number of CS, contexts used...) to improve the understanding of the different approaches. Similarly, methods for the reinstatement protocol (Exp 2) are missing.

      We understand that the information in the main text is quite dense, but we explicitly chose to focus on the central message here, i.e., that we applied standard procedures that should have allowed us to detect amnestic effects in consideration of most of the published literature. In addition, the crucial overview of the number of training and test trials, as well as the context that was used for each session is depicted in Fig. 1-3, immediately above the results of the respective experiments.

      In the Supplement, we provide a more extensive (and repetitive) report of the experimental procedures. The idea is that the reader can find the most important information in the main text, and all additional details in the Supplement (or in our preregistrations on the Open Science Framework: https://osf.io/j5dgx ). For example, in the main text, it is mentioned that reinstatement in Experiment 2 consisted of two US presentations in context A, one day before the final test (see p. 6 and Fig. 1C). The Supplement (p. 1) adds that the reinstatement session started with 300 s of acclimation, followed by the first US and 180 s later by the second US, and that the rat was removed from the context 120 s after last US onset. For all phases of Experiment 2, the US was a 0.7-mA, 1-s shock.

      • In exp 5, tests 1 and 2 are supposed to have 12 CS each. However, only 8 dots are represented on the graph. Did the authors average some freezing values after the initial 4 first CS presentations?

      Thank you for noticing this. We did not average freezing values, but just did not measure freezing on all trials, as we were not specifically interested in the concrete freezing levels on each trial, but rather in the overall extinction curve. As mentioned in the legend of Fig. 2, freezing during CS5-7-9-11 was not measured (and hence also not shown). In other words, the 8 dots on the graph represent CS1-2-3-4-6-8-10-12.

      -There is an obvious difference in baseline freezing response before the test in Exp 7 (Figure 5A-B). Discussion of these differences is an important point and was thoroughly discussed by the authors in the Supplement.

      Thank you for pointing this out.

      -Ln 384-387: "... additional Bayesian analyses were carried out that collectively suggested substantial evidence for the absence of an amnestic effect". Despite the "substantial effect" given by the meta-analysis, I am a bit confused by the meaning of an "anecdotal evidence against drug < control" reported in half of the experiments. How do the authors interpret these results?

      In short, Bayesian analyses provide evidence that is categorized starting from ‘no evidence’, to ‘anecdotal’, ‘substantial’, ‘strong’, etc. depending on the obtained Bayes factor. Grouping studies with anecdotal and substantial evidence in a meta-analysis can result in overall substantial evidence, which is what we observed here.

      Addressing this remark in more detail, we want to point out that the use of frequentist analyses (ANOVAs and t-tests) allowed us to conclude that we could not replicate the amnestic effects of previously published studies – we did not obtain a statistically significant amnestic effect although we had sufficient power to detect the effect sizes that had been previously reported. However, those analyses do not permit us to make inferences about the evidence against an amnestic effect. Bayesian analyses, on the other hand, do allow us to quantify the obtained evidence against an amnestic effect (i.e., the null hypothesis) for each single experiment or by combining the results of several studies. When a single study suggests only anecdotal evidence against an amnestic effect, this implies that we cannot conclude based on that study alone that we have proper evidence for the absence of an effect. Rather, we can only conclude that we have no evidence for the presence of an amnestic effect and weak (‘anecdotal’) evidence for its absence. However, a collective analysis of our studies does lead to the conclusion of substantial evidence for the absence of an amnestic effect overall.

      -The effect of cycloheximide on memory consolidation is indeed unexpected. Even if beyond the scope of the current study, what is the authors' hypothesis to explain that cycloheximide in their conditions induced a pro-mnesic effects on the consolidation of fear memories but altered the consolidation of extinction?

      As indicated by the reviewer, this is beyond the scope of the current study. We have no additional data on this effect and can only guess at its meaning. Also note that the effect was rather small and disappeared quickly during the test under extinction.

      One purely speculative hypothesis is that the injection with cycloheximide was more arousing than the vehicle injection, either due to sensations caused by the substance during injection or due to the rapidly emerging malaise it induced (or a combination of both), which we have documented in the Supplement (p. 5).

      In line with work by McGaugh, Roozendaal and colleagues, such arousal around the time of training could, in theory, enhance consolidation of a fearful memory, and thus explain greater fear memory during test (see e.g., Roozendaal & McGaugh (2011), https://doi.org/10.1037/a0026187 ). Then again, a similar argument could be made for improved consolidation of the extinction memory (de Quervain et al. (2019), https://doi.org/10.1007/s00213-018-5116-0 ), which we did not observe. One could suggest that – assuming that we have observed ‘true’ effects here – the arousal component had the upper hand during the consolidation of the fear memory, while the protein synthesis inhibition overruled such effects during consolidation of the extinction memory. As this is all highly speculative, we prefer to not add this to the Discussion.

      -Cycloheximide seemed to induced post reconsolidation amnesia of fear memory after extinction training (Exp 8, Fig 3G) but not after single CS reactivation. Can the authors please develop this point? Is it possible that several presentations of the CS is required to destabilise the initial memory trace?

      First of all, it is important to emphasize that cycloheximide-treated rats in Experiment 8 (Fig. 3G) froze more during the CSs of Test 2 than control animals, arguing against a drug-induced reconsolidation blockade of the initial fear memory. Furthermore, the obvious within-session extinction during Test 1 in Experiment 8 suggests that it did not function as a typical reactivation-without-extinction session (Merlo et al. (2014), https://doi.org/10.1523/JNEUROSCI.4001-13.2014 ).

      In light of the current literature, reactivation with a single CS is by far the most common way to destabilize a memory trace that was formed with one or three CS-US pairings. As mentioned in our paper, this should provide an appropriate degree of prediction error for the memory to become malleable (p. 12).

      Theoretically, it is indeed possible that more than one (e.g., two) CS presentations could allow for destabilization of the memory trace, although others who have used reactivation sessions with more than one CS presentation did not find the amnestic effects that they did observe with a single CS (Merlo et al. (2014); Sevenster et al. (2014), https://doi.org/10.1101/lm.035493.114 ).

      Reviewer #3:

      Luyten et al's study examines the phenomenon of drug-induced post-retrieval amnesia for auditory fear memories in rats, and report that after several experiments using Propranolol, Rapamycin, Anisomycin or Cycloheximide that they essentially observe no disruption of reconsolidation, (i.e., no amnesia). This is a well-executed, written and meticulous study examining an important phenomenon. The author's lack of observing amnesia using these "reconsolidation blockers" highlights an important fact that systemic administration of these drugs at the time of memory retrieval may not robustly influence reconsolidation processes despite what the existing literature may collectively indicate. The author's data clearly indicate this point and it is important the scientific community be made aware of these difficulties in blocking reconsolidation using systemic administration of these drugs.

      We are thankful for these generous comments and value the reviewer’s thorough and thoughtful assessment of our work. We also appreciate the reviewer’s position that it is important to get this message across to the scientific community.

      This group has previously published similar studies disputing similar phenomena. First highlighting a lack of amnesia following the reconsolidation-extinction paradigm and then more recently demonstrating a lack of amnesia attempting to block the reconsolidation of context fear memories. This is now their third installment focusing on Cued fear memories. Certainly, these findings are important, but arguably the novelty of such findings may be diminished a bit.

      We appreciate that the reviewer is well aware of some of our other work in this domain that supports a more general and widespread reproducibility crisis in this field.

      Regarding the novelty, one key point to stress here, which is also articulated in the paper (p. 3, 13), is that the current rodent findings (which we could not replicate) are the ones that provide the most direct basis for the clinical translations that have been proposed (e.g., by giving patients a propranolol pill after retrieval of a traumatic or phobic memory, see e.g., https://kindtclinics.com/en/ or Kindt & van Emmerik (2016), https://doi.org/10.1177/2045125316644541 ), and are therefore critical in their own right, not only because of their fundamental scientific relevance, but certainly also in light of their clinical reach.

      In one of the "control" experiments where the experimenters administer anisomycin immediately post training, they observe a paradoxical result - they observe memory strengthening instead of the expected blockade of consolidation and amnesia. This result highlights a number of things to consider when we interpret these overall results. For one protein synthesis inhibitors(PSIs) are toxic and when administered systemically usually result in inducing the animals to have diarrhea and generally just makes them sick. This of course will make the animals stressed and agitated and result in increasing their stress and likely amygdala activity. All of this could likely be the reason why the animals exhibited memory strengthening or no impairment in consolidation even with a PSI on board. See PMCID: PMC7147976. Figure 6. In this study, they could rescue the impairment of PSI on consolidation by increasing BLA principal neuron firing. Thus an important take away is something like this could easily be happening in the reconsolidation experiments - that there is no blockade because the animals are stressed either due to PSI on board or because some issues with experimenter/animal interactions, etc lead to higher BLA neural activity and rescue of the reconsolidation process.

      We agree that (systemic) protein synthesis inhibitors can induce signs of sickness in the animals (particularly in the first hours after injection) and have provided a detailed description of our relevant observations in the Supplement (p. 4-5). The reviewer is completely correct in stating that this may cause some amygdala activation which could interfere with the amnestic effects that we expected to see, as described in the paper by Shrestha, Ayata et al. (2020), and in line with our reply to Reviewer #2’s first comment regarding our cycloheximide experiment. Yet, effective induction of amnesia with these drugs has repeatedly been reported in the literature.

      Nevertheless, although relevant, the current remark has relatively little implications for our findings. In the large majority of our experiments, we did not use these toxic protein synthesis inhibitors (PSIs) (such as cycloheximide and anisomycin), but drugs that have generally been administered systemically throughout the literature (with successful amnestic effects). Furthermore, in the experiments where we did administer systemic cycloheximide or anisomycin, we observed no differences compared to vehicle-treated rats in contextual freezing (e.g., 9% on average in Experiment 7) immediately prior to the crucial test tones (Test 1, 24h after injection) – which argues against high levels of stress or agitation. Moreover, a blinded experimenter could not tell the difference between PSI-treated versus vehicle-treated animals while handling the animals for the test session, and observed no behavioral abnormalities, nor signs of pain or distress, as mentioned in the Supplement. We acknowledge that these experimenter observations may not entirely reflect what is happening in the animals’ amygdala, but they at least go against the notion that PSI-treated animals would be too sick to be tested properly.

      I don't think the authors go far enough articulating the important differences between systemic and intra-cranial administration of these drugs. Time is a potential factor. Immediate administration of the drug at high concentration in the target brain region (BLA) versus many minutes until the drug gets to the target region with uncertain concentration levels that may not mirror levels reached with intracranial administration. It's unfortunate the authors were not able to include intra-BLA administration of these drugs in this study. I do not necessarily expect them to do such experiments, since they have already done so much and it is not clear the laboratory has the appropriate expertise to conduct such experiments, but this comparison would be helpful.

      We fully agree that our results do not provide any information about the replicability of intracranial administration of drugs to induce post-retrieval amnesia of cued fear memories. We had already clearly acknowledged this in the first version of the paper (p. 11), but have now added an extra section to the Discussion (p. 13) to highlight this point in the new version posted on BioRxiv (Version 2). Notwithstanding the expertise of our laboratory to carry out intracranial infusions, we agree with the reviewer that such experiments are beyond the scope of this article.

      It is, however, noteworthy that the drugs that we used in 6 experiments did not necessarily rely on intracranial administration in prior successful studies. Rapamycin, for example, has generally been used systemically (not intracranially). Propranolol has been used either systemically or intracranially in rodents and always systemically in human subjects (healthy and patients). Bearing in mind the timing issue that was raised by the reviewer, we moreover included an experiment with pre-reactivation administration of propranolol (Experiment 4), where the drug was injected 5-8 minutes before the rats heard the reactivation tone.

      I think it is important that the authors make some statement of training conditions on cannulated versus cannulated rats. For example, every animal in Nader's 2000 study was bilaterally cannulated targeting the BLA. In contrast every animal in this study underwent no such surgery. I think this is relevant. In my experience non cannulated animals are a bit smarter than cannulated animals and the training conditions across these two differing groups may not equate to the same level of learning. And of course, differences in learning levels can lead to differences in the ability of the retrieved memory to destabilize.

      Thank you for pointing this out. We are aware that there may be differences between operated and non-operated animals and already briefly discussed this matter in the Supplement (p. 4). We have now also added this issue to the Discussion in the new section (p. 13) where we emphasize the differences between systemic and intracranial drug administration in relation to the previous comment.

      That being said, the comment regarding (non-)cannulated rats only really applies to Experiment 7 where we tested the effects of systemic anisomycin or cycloheximide. Prior cued fear conditioning studies indeed used intracranial administration of these drugs. The argument does not hold for Experiments 1-6, as systemic propranolol and rapamycin have repeatedly been reported to have amnestic effects in non-operated rats, with procedures identical to or closely resembling ours.

      The authors mention possibly examining markers of memory destabilization. GluR1 phosphorylation, Glur2 surface levels, protein degradation/ubiquitination have all been used to assess if destabilization has occurred. I do not fully agree with their reasons for not performing such experiments. They could examine some or one of these phenomena across differing training conditions between retrieval, no-retrieval animals. This likely could be informative. However, the authors may not possess the necessary expertise to conduct such experiments, so I'm not stating these experiments need to be completed, but certainly the study could be strengthened with such data.

      We agree that including yet more control experiments, using different experimental approaches could further strengthen the study. Nevertheless, the main conclusion of our paper – i.e., reconsolidation blockade using systemic administration of several drugs is considerably more difficult to reproduce than what the literature collectively indicates – is strongly and sufficiently supported by the data that we already report here. Overall, we believe that our conclusion does not require such additional controls. Moreover, even though the comparisons suggested by the reviewer could indeed be scientifically interesting, it is still unclear whether such experiments would provide sufficiently clear cut-offs as to which experimental condition would then allow for adequate memory destabilization and interference.

      Experiment 3E - Propranolol without reactivation. I don't see any data for this on the graphs. Am I missing something?

      Our apologies for the confusion. The legend shown next to Fig. 1F applies to all panels of Fig. 1, but only Experiment 1 (shown in Fig. 1A-B) contained a no-reactivation group as an additional control. Experiment 3 (shown in Fig. 1E-F) did not. We have moved the legend to the bottom of Fig. 1 to clarify this.

      The authors should probably cite this paper too, PMID: 21688892. The authors in this study find no evidence that propranolol inhibits cued fear memory reconsolidation.

      Thank you for bringing this to our attention. We were aware of this paper, but it had slipped through the cracks. We have cited it in the new version of the paper (p. 11).

    1. Author Response

      We thank the editors for considering our manuscript for publication in eLife and the reviewers for their work. However, we would like to discuss several of their comments.

      The key issue seems to be a lack of novelty of our work, which is not correct in our opinion.

      We would like to quickly reiterate why we think that our findings are novel and have very broad implications.

      The importance of polygenic adaptation is becoming increasingly clear. Unfortunately, it is widely assumed that polygenic adaptation is very difficult, if not impossible, to study in natural populations, because the associated allele frequency shifts are too small to be experimentally characterized (Pritchard et al., 2010). Hence, typically the collective response of many loci are considered, which frequently results in wrong results due to population stratification (Berg et al., 2019; Sohail et al., 2019).

      Therefore, we have used experimental evolution to characterize polygenic adaptation. Experimental evolution is widely recognized as a powerful tool because of the possibility to replicate experiments. Here, we expand the power of experimental evolution by an hitherto unrecognized aspect: the impact of linkage disequilibrium - we demonstrate that two founder populations with different levels of linkage disequilibrium (LD) result in entirely different selection responses. The consequence of different LD structures is shown by our observation that the same population (i.e. identical LD structure) evolving in two different environments shows the same selection response, but a different population with different LD structure in the same environment shows different selection responses.

      This result has important implications for all studies of polygenic adaptation in natural populations because LD is not accounted for in studies of polygenic adaptation, but like in our study, haplotype blocks with multiple loci could result in a strongly selected allele. Hence, LD will determine the likelihood of this to occur. Furthermore, accounting for linkage provides the opportunity to study polygenic adaptation also in natural populations - a substantial change to the current testing paradigms.

      The second key result of our study is that we demonstrate that selection in hot and cold environments does not fit the simple model of polygenic adaptation, where the same set of loci is responding in different directions, when opposing selection regimes are applied. As pointed out by reviewer #2, this is particularly important as it shows that current models of polygenic adaptation are not well-suited to understand adaptation imposed by contrasting ecological factors. We show that there is almost no overlap between the haplotype blocks selected in the hot and cold environment. Most importantly, this is not a matter of power as we show that the blocks responding in one selection regime are not changing their frequency in the opposite direction in the other selection regime. We anticipate that this insight will have a profound impact on theoretical models of polygenic adaptation. Furthermore, as we studied temperature adaptation, our results will have also important consequences for the battery of ongoing studies aiming to link selection signatures to response to climate change.

      In brief, we think that very minor clarifications in our manuscript can solve the technical issues identified by the reviewers and will provide a clearer picture about the general implications of our findings.

      A detailed response to the comments of the reviewers is given below.

      Reviewer #1:

      Otte et al. used an evolve and re-sequence strategy to explore "the genetic architecture of adaptive phenotypes". The authors previously found different genetic architectures across different founder populations evolving in a common hot environment. The authors chose one of these founder populations for replicated experimental evolution (5 replicate populations) in a cold environment for 50 generations. The authors were surprised to discover the same number of loci evolve under strong selection between the hot-evolved and cold-evolved replicate populations, though the 20-ish loci are largely non-overlapping. The distribution of selection coefficients was also similar. They interpret this commonality as evidence that the founder population history has a larger effect on adaptive architecture than the selection regime.

      The study demonstrates a comprehensive effort to discover the number of genome regions and distribution of selection coefficients that emerge from a highly controlled experimental evolution project. The experienced team applies a sophisticated toolkit to this powerful experimental design - a toolkit that grows ever more sophisticated with each new experimental run that they perform. However, the authors set me up to learn why such different adaptive architectures emerge from different founder populations. Ultimately, the researchers acknowledge that they "cannot pinpoint the cause for the differences in the inferred adaptive architecture..."

      Here, the reviewer correctly identified one of the main new questions that arose from the new experiment we performed in this study. In a large part of the discussion and the associated analyses we are providing answers to this question, i.e. possible alternative explanations for the different observed architectures in the Portugal vs. the Florida population. We can indeed not pinpoint "the" cause for the differences that the reviewer seems to request here as a definite answer, but we favour one of the explanations that has not yet been discussed in literature previously (LD).

      Some results simply recapitulated the previous Portugal E&R study and other results recapitulated a D. melanogaster E&R study.

      This statement about "some results" is ignoring the main new experiment of this study, which is the Portugal population evolving in a cold temperature. For this, we carried out a new selection experiment in a new environment, which finds different selection targets than the previously published experiments. This new experiment therefore does not recapitulate the previous results. We then compare this new experiment to a previous one, and this comparison raises a set of new questions that we address in this manuscript. Only for the purpose of making that comparison, we indeed "simply recapitulated" "some results" of the previous study. The statement is therefore misleading in the way it is put here. Furthermore, the D. melanogaster study is also not recapitulated: in that study, it was not possible to identify selected haplotypes. The D. melanogaster study was therefore unable to determine how many selection targets were shared between the hot and cold selection regimes. The identification of selected haplotypes was a major improvement in this study, which made it possible only now to determine how many targets are shared and to evaluate whether selection targets behave as predicted by the trait optimum model.

      I did not find the "common adaptive architecture" across different selection regimes to be a particularly compelling discovery of sufficiently broad interest.

      This is a very subjective opinion and it would be good if the reviewer had explained why this is no interesting discovery to her/him. We feel that this statement simply reflects that the reviewer does not fully appreciate the complexity of polygenic adaptation. We would like to point out again, that this result has important implications for the interpretation of selection signatures in natural populations.

      Other concerns and questions can be found below:

      Major concerns:

      1) Pg. 4: It is my understanding that the power of multiple populations from a single founder evolving in parallel allows for more rigorous identification of loci targeted by selection. I found it surprising to discover that if a lack of replication emerges from an experimental evolution study, this outcome is interpreted as "genetic redundancy." First, genetic redundancy has a precise definition in genetics that muddles the author's meaning. And second this interpretation seems rather post-hoc.

      This statement shows that the reviewer is disregarding the work of Barghi et al (2019, PLoS Biology) and the definition of redundancy in the context of polygenic adaptation as discussed by Laruson et al. (2020) or Barghi et al 2020 (Nature Reviews Genetics). In any case, this is a semantic issue and should not be considered as a major issue with our manuscript.

      2) To "shed more light on the different selection responses" is a weak motivation. The introduction sets me up to understand why selection responses are so different but no major insights into the "why" emerge from the cold-adaptation experiment.

      We modestly disagree - we clearly discuss different explanations of “why” and favor one of them (LD)

      3) More explanation of figure 1 in the main text is needed. Does each point correspond to a SNP that consistently changes across all five populations? Or is this the union?

      The reviewer does not seem to be familiar with the statistical analyses that have been used in our study in the same way as it is common practice in the field. Despite the common use of this test, we still provided a detailed explanation in M&M and explicitly mentioned the test in the figure legend. But this can easily be detailed even further and should not be a major issue with this manuscript.

      4) Line 210: How did the researchers define "stress" and determine that the degree of stress is equivalent across two temperature regimes? The absence of these data undermine the potency of the comparison.

      It is not clear why the reviewer requires a more elaborate definition of temperature stress - the concept of extreme temperatures imposing stress is well established and we cite the relevant literature for Drosophila in the text. Furthermore, it is not apparent why the reviewer requests the degree of stress to be equivalent between the two temperature regimes.

      5) How can the authors be sure that the only difference between the hot and cold populations was temperature? Was competition/population size/etc held constant? Might the lack of overlap between hot and cold adapted loci stem from one such regime selecting for a different phenotype? (i.e., not temperature tolerance)

      As clearly stated in M&M, the culture conditions were the same with the exception of temperature.

      6) Line 237: The authors assert that most alleles show a temperature-specific response - a discovery with precedent in the literature, including from this team of researchers. The authors attribute the absence of common loci between temperature regimes to the high number of generations (50) compared to the number across seasons cited in Bergland et al. The researcher could easily look for common targets at earlier time points of experimental evolution to test this idea.

      This is an interesting suggestion, but the reviewer fails to explain why the analysis of early generations should be more informative than the analysis of later generations. Several studies have already documented the opposite.

      7) Line 292-293: This section reads as disingenuous - the researchers could have explored overlap between Portugal and Florida founders using only the selected loci coordinates and look for non-random overlap using simulations/resampling tests.

      The reviewer seems to assume that we could easily apply the same test for overlap that we used for the hot vs. cold comparison within the Portugal population to the Portugal hot vs. Florida hot comparison. But this is not feasible, and we clearly explain why the comparison of selected haplotype blocks between different founder populations is not helpful (low LD results in different haplotype blocks - even with the same target)

      8) Discussion: The speculation about why such different architectures emerged across Portugal and Florida was diluted by the absence of initial fitness estimation upon subjection to a cold environment (which would have offered evidence for different initial "optima" across founder populations) as well as the change in fitness from generation 0 to generation 50.

      It is not apparent why the reviewer requests a fitness estimate at the cold environment. Our analysis only included a single population in the cold environment. Hence, the only informative comparison is the one in the hot environment which has been done for both populations and is referenced in the manuscript.

      9) The simulations and corresponding discussion would make for an interesting review/opinion piece but not as new results for this manuscript.

      Unlike the reviewer, we think that a good discussion puts the results into perspective with different hypotheses on how to explain it and link this to the current literature.

      Minor Comments:

      1) Pg. 3. The recurrent citation of Barghi et al. in the Introduction undermined the reader's impression that fundamental questions are being addressed in this article.

      Maybe it escaped the reviewer’s attention that we cited three different Barghi et al. papers and only one reports experimental data (cited only once), while the others are required to describe the theoretical framework, including the concept of "redundancy" which the reviewer misunderstood. New fundamental questions in this current manuscript are addressed using the Portugal population, which was selected in a cold temperature regime (not hot-evolved Florida, which was the topic of Barghi et al. 2019).

      2) Lines 33-39: The argument that parallel signatures of selection across distinct natural populations are insufficient to address the polygenic basis of adaptive phenotypes, and so comparatively more contrived E&R studies are required, was unconvincing.

      Unfortunately, the reviewer does not provide support for this strong statement. In fact, we find the statement of “contrived E&R studies” not as objective as we would have liked to see in a scientific discourse.

      3) Line 158: Confusing. Should "among" actually be "within"?

      The reviewer is not right - the correct wording is "among" not within: multiple different haplotypes can carry the actual target of selection, and they can differ by additional variants which themselves are not selected for. Multiple haplotypes with the selection target are also experiencing more pronounced frequency changes than expected under neutrality. The correlation of their allele frequency trajectories depends, however, on the extent that hitchhiking SNPs are shared among these haplotypes. To account for this, we used a less stringent correlation cutoff.

      4) Line 486: I believe that the authors would be hard-pressed to find in the literature a paper declaring that "single population...[is] sufficient to understand the genetic basis of adaptive traits".

      In fact, many selection tests are targeting only a single population and most studies only apply them to a single population.

      Reviewer #2:

      This reviewer mainly asks us to discuss some of his/her ideas - this can be done, but since reviewer#1 felt already that there is too much discussion in our manuscript this is a bit of a mixed message.

      Overall Review: This is another commendable study from the Schloterer lab that features next generation genome-wide sequencing of multiple evolving populations. It compares results obtained with two different selection regimes, one hot and one cold, and two different founding populations of Drosophila simulans, one from Portugal and one from Florida. The results reveal a lack of consistency among selection regimes and founding populations. Temperature-dependent adaptation is shown to be "local" or "contingent," rather than globally consistent. My chief recommendations concern the experimental and theoretical contexts within which this study should be interpreted.

      Major points:

      1) I do not require any additional data collection or statistical revision. My comments are organized in terms of experimental paradigm (A) and theoretical significance (B).

      A.

      2) The typical paradigm for experimental evolution in this and many other labs is the use of hybrid populations created from isofemale lines. This method for founding experimental populations can be expected to generate some degree of random "historicity" as the isofemale lines approach fixation of specific genotypes with high stochasticity. Then there are further stochastic and historical effects which arise when such lines are hybridized. The strengths and limitations of this paradigm should be addressed. Most importantly, such stochastic historical effects might be the source of the discrepancy between the replicate lines derived from Portugal and Florida.

      We would like to emphasize that we were using freshly established isofemale lines kept in the laboratory for at most 10 generations, as stated in the M&M section.

      3) As the authors themselves point out, there is a comparative difficulty arising from the different scales of replication used for the Florida versus Portugal experiments.

      The reviewer is correct, and since we were aware of this, we performed statistical tests to account for this.

      A further question for large-scale experimentation is whether a larger and uniform level of replication might produce more similar results, such as 20 evolving populations from each source. Or indeed, three sets of ten evolving populations from three distinct founders from the two sources, with a total of 60 evolving experimental lineages. The authors should discuss whether they believe that their findings would hold up with such an expanded experimental protocol.

      This is an interesting thought of its own, but we feel that it does not contribute much to our current study.

      4) The authors themselves point out at one point that their experiments might have benefitted from some phenotypic characterization of the presumed temperature adaptation. That raises the more general question of how the field of experimental evolution can progress with some labs just doing phenotypes and other labs just doing genome-wide sequencing. Surely this and other studies would be strengthened by combining the two types of assay. Furthermore, genomic evolution might be usefully analyzed in terms of the degree to which specific genomic changes can be associated with specific phenotypic changes, as that is the foundation for adaptation itself.

      We would like to draw the attention to the fact that we performed a laboratory natural selection experiment, for which the environmental factor is known, but not the actually selected phenotype - hence the phenotyping is not as trivial as implied by the reviewer.

      B.

      5) This is yet another study that finds difficulties with the invocation of noroptimal selection along a one-dimensional functional gradient. Such models have been long-standing favorites of evolutionary theorists, such as Kimura and Lande. But that preference may arise more from the ease with which these models can be formulated and analyzed by theoreticians. Actual evolving populations don't seem to embody the precepts of such theory, whether the issue is the maintenance of genetic variation (see the work of Turelli, for example) or the evolution of closely studied populations, as illustrated by this study. An alternative point of view that the authors should discuss is that such models are indeed NOT usually correct.

      It is very interesting that this reviewer feels that our data demonstrate that the prevailing model of polygenic adaptation is wrong, but our manuscript is still considered to be of insufficient novelty.

      6) There are alternative theoretical frameworks that address the maintenance of genetic variation and the response to selection. Among these are schemes of protected polymorphism arising from overdominance, epistasis, and frequency-dependent selection. If the thrust of the preceding point 4 is accepted, then it would be theoretically salient for the authors to suggest what type of underlying population genetic machinery would best account for their findings, in place of the noroptimal selection-mutation balance model.

      We thank the reviewer for these interesting suggestions. However, their predictions are not at all trivial to test. For this reason, generations of population geneticists tried to test them, so we feel that this task is well beyond the scope of this manuscript.

      Reviewer #3:

      In their manuscript 'The adaptive architecture is shaped by population ancestry and not by selection regime,' Otte and colleagues use an evolve and resequence strategy to examine the response of a Portugal population of D. simulans responds to cold temperature. The authors identify putative targets of selection and compare the number of targets, their location, and the distribution of selection coefficients to previous work on the same population exposed to hot temperatures as well as a different population exposed to hot temperatures. The topic is of general interest, the work is sound and the writing is clear and concise.

      1) It is not clear what the novel contribution of this manuscript is. The title indicates that the key finding is that population of origin mediates response to selection rather than the selection regime. However, the authors fail to provide compelling data to support that. The data are from 1 population under two selection regimes and a second population under one of those regimes. There simply aren't enough comparisons to infer that population ancestry plays a bigger role than selection regime in adaptive evolution.

      We disagree with the reviewer and would like to repeat the logic of our experiment:

      Comparison 1: contrast of different populations in the same environment -> different architecture

      Comparison 2: contrast of the same population in different environments -> same architecture

      With this simple design it is possible to reach the conclusion that the architecture is affected by population history more than by selection regime and no more populations are needed to reach this conclusion. This insight has not been reported before.

      2) The authors also seem to argue that a contribution of this paper is that it illustrates that temperature adaptation is not a single trait. This was the major finding of a 2014 paper from the same group in D. melanogaster- a single founder population was exposed to hot and cold temperatures and the authors found almost no overlap between the putatively selected variants in the two different temperature regimes.

      We would like to point out that the analysis of Tobler et al. (2014) is on the basis of individual SNPs, which is difficult to interpret because of the many segregating inversions in D. melanogaster. All the complications of these data and the implications for the interpretation can be found in the discussion of Tobler et al. (2014). In the current study, we are identifying selected haplotype blocks, which is mandatory to compare the architectures and selection responses.

      3) Beyond the limited impact of the current work, there are some additional specific issues. The authors note that it was 'remarkable' that the distribution of selection coefficients and the number of inferred selection targets between the hot and cold experiments was 'highly similar.' What is the null expectation? Where does the null come from?

      This is a minor semantic issue. Naturally, there is no null model for the number of selection targets, but if two populations selected for the same trait provide different architectures, different selection regimes should be even more likely to generate different architectures.

      4) The discussion is somewhat unsatisfying and largely speculative. The 'different trait optima' section reads as straw man; this could be reframed to better guide the reader.

      Naturally, the discussion intends to put the results in a broader context. It would have been helpful to read how s/he envisions a reframing that would improve the manuscript.

      There is little support for the 'differences in adaptive variation' hypothesis.

      It would have been helpful to read which kind of support the reviewer would have expected beyond the evidence we have already provided.

      The section on LD was interesting, but the simulation findings should reside in the results section.

      This could be easily moved, but we feel that it is well-placed in the discussion as we use the simulations to compensate for the lack of literature on this field (again demonstrating the novelty of our manuscript).

      References:

      Barghi, N., R. Tobler, V. Nolte, A. M. Jakšić, F. Mallard, K. A. Otte, M. Dolezal, T. Taus, R. Kofler, & C. Schlötterer (2019). Genetic redundancy fuels polygenic adaptation in Drosophila. PLOS Biology 17: e3000128.

      Barghi, N., J. Hermisson, & C. Schlötterer (2020). Polygenic adaptation: a unifying framework to understand positive selection. Nature Reviews Genetics . Berg, J.J., Harpak, A., Sinnott-Armstrong, N., Joergensen, A.M., Mostafavi, H., Field, Y., Boyle, E.A., Zhang, X., Racimo, F., Pritchard, J.K., et al. (2019). Reduced signal for polygenic adaptation of height in UK Biobank. Elife 8.

      Bergland, A. O., E. L. Behrman, K. R. O’Brien, P. S. Schmidt, & D. A. Petrov (2014). Genomic Evidence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila. PLoS Genetics 10, e1004775.

      Láruson, Á. J., S. Yeaman, & K. E. Lotterhos (2020). The Importance of Genetic Redundancy in Evolution. Trends in Ecology and Evolution 35: 809–822. Pritchard, J.K., Pickrell, J.K., and Coop, G. (2010). The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current biology : CB 20, R208-215.

      Sohail, M., Maier, R.M., Ganna, A., Bloemendal, A., Martin, A.R., Turchin, M.C., Chiang, C.W., Hirschhorn, J., Daly, M.J., Patterson, N., et al. (2019). Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. Elife 8.

    1. Author Response

      Note from the authors:

      This is the authors' response to the reviewers' comments for the manuscript “Perceptual gating of a brainstem reflex facilitates speech understanding in humans” submitted to eLife via Preprint Review. We appreciate the time and effort the reviewers took to carefully revise our work. We believe all comments and suggestions will improve the manuscript for future publication. All the authors’ comments detailed in this response will be implemented in the next version of this manuscript.

      Reviewer #1: [...] Reviewer 1-Comment 1: 1) An important aspect of assessing the efferent feedback through the CEOAEs and ABRs is to ensure that different stimuli have equal intensity. The authors write in the methodology that the speech stimuli were presented at 75 dB SPL. However, it is not stated if this applies to the speech stimuli only, such that the stimuli that include background noise would have a higher intensity, or to the net stimuli. If the intensity of the speech signals alone had been kept at 75 dB SPL while the background noise had been increased, this would render the net signal louder and influence the MOCR. In addition, it would have been better to determine the loudness of the signals according to frequency weighting of the human auditory system, especially regarding the vocoded speech, to ensure equal loudness. If that was not done, how can the authors control for differences in perceived loudness resulting from the different stimuli?

      Response to Reviewer 1-Comment 1:

      Controlling the stimulus level is a critical step when recording any type of OAE due to the potential activation of the middle ear muscle reflex (MEMR). High intensity sounds delivered to an ear can evoke contractions of both the stapedius and the tensor tympani muscles causing the ossicular chain to stiffen and the impedance of middle ear sound transmission to increase (Murata et al.,1986; Liberman & Guinan,1998). As a result, retrograde middle ear transmission of OAE magnitude can be reduced due to MEMR and not MOCR activation (Lee et al., 2006). For this reason, we were particularly careful to determine the presentation level of our stimuli.

      As pointed out by the reviewer and stated in the Methods section: Experimental Protocol: “The speech tokens were presented at 75 dB SPL and the click stimulus at 75 dB p-p, therefore no MEMR contribution was expected given a minimum of 10 dB difference between MEMR thresholds and stimulus levels (ANSI S3.6-1996 standards for the conversion of dB SPL to dB HL)”. 75 dB SPL was indeed selected as the presentation level for all natural, noise vocoded and speech-in-noise tokens. All tokens were root-mean-square normalized and the calibration system (sound level meter (B&K G4) and microphone IEC 60711 Ear Simulator RA 0045 563 (BS EN 60645-3:2007), (see CEOAEs acquisition and analysis section)) was set to “A-Weighting” which matches the human auditory range. Therefore, the net signal was never above 75 dBA. We acknowledge the lack of details about the calibration procedure in the current manuscript and will consequently add them in a future Methods section.

      Reviewer 1-Comment 2: 2) Many of the p-values that show statistical significance are actually near the threshold of 0.05 (such as in the paragraph lines 147-181). This is particularly concerning due to the large number of statistical tests that were carried out. The authors state in the Methods section that they used the Bonferroni correction to account for multiple comparisons. This is in principle adequate, but the authors do not detail what number of multiple comparisons they used for the correction for each of the tests. This should be spelled out, so that the correction for multiple comparisons can be properly verified.

      Response to Reviewer 1-Comment 2:

      Bonferroni corrections were explicitly chosen as the multiple comparisons adjustment across our post-hoc statistical analyses because they are a highly conservative test that protect from Type I error. All the p-values reported in our study are corrected p-values for post-hoc comparisons. However, we agree that for verification purposes, the number of comparisons for each statistical analysis should be clarified in the Methods section and will be added to a future version of the manuscript.

      Reviewer 1-Comment 3: 3) Line 184-203: It is not clear what speech material is being discussed. Is it the noise vocoded speech, the speech in either type of background noise, or these data taken together?

      Response to Reviewer 1-Comment 3:

      Lines 184-203 correspond to “Auditory brainstem activity reflects changes in cochlear gain” in the Results section. Line 186 describes changes in ABR components during noise-vocoded speech: “Click-evoked ABRs—measured during simultaneous presentation of vocoded speech—showed task-engagement-specific effects similar to the effects observed for CEOAE measurements.” The subsequent 3 sentences refer to the same (noise-vocoded) condition, whereas the remaining sentences in the section refer to the speech-in-noise conditions. As pointed out by the reviewer we did not specify a specific masked condition in the sentence: “Conversely, although wave III was unchanged in both masked conditions for active vs. passive listening, wave V was significantly enhanced: [F (1, 26) = 5.67, p = 0.025 and F (1, 25) = 8.91, p = 0.006] when a lexical decision was required.” Here the rANOVAs correspond to masked conditions: speech in babble noise and speech-shaped noise respectively. This will be rectified in a future version of the manuscript.

      Reviewer 1-Comment 4: 4) Line 202-203: The authors write that "the ABR data suggest different brain mechanisms are tapped across the different speech manipulations in order to maintain iso-performance levels". It is not clear what evidence supports this conclusion. In particular, from Figure 1D, it appears plausible that the effects seen in the auditory brainstem may be entirely driven by the MOCR effect. To see this, please note that absence of statistical significance does not imply that there is no effect. In particular, although some differences between active and passive listening conditions are non-significant, this may be due to noise, which may mask significant effects. Importantly, where there are significant differences between the active and the passive scenario, they are in the same direction for the different measures (CEOAEs, Wave III, Wave V). Of course, that does not mean that nothing else might happen at the brainstem level, but the evidence for this is lacking.

      Response to Reviewer 1-Comment 4:

      Lines 202-203 also correspond to “Auditory brainstem activity reflects changes in cochlear gain” in the Results section. As suggested by the reviewer, the effects observed in the ABRs may be driven by the MOCR. We agree with this observation in lines 195-197, explaining that the decreased magnitude of ABR components is consistent with reduced magnitude of CEOAEs measured during active listening in the vocoded condition, since a reduction in cochlear gain can reduce the activity of auditory nerve (AN) afferents synapsing in the cochlear nucleus (CN). However, we did not explain that this trend is also observed during the passive listening of speech-in-noise, therefore demonstrating that vocoded and speech-in-noise are differently processed at the level of the brainstem and midbrain. In a future version of the manuscript, we will restrict our interpretation to statistical comparisons in the Results and leave potential mechanisms for the Discussion section.

      Reviewer 1-Comment 5: 5) The way the output from the computational model is analyzed appears to bias the results towards the author's preferred conclusion. In particular, the authors use the correlation between the simulated neural output for a degraded speech signal, say speech in noise, and the neural output to the speech signal in quiet with the efferent feedback activated. They then compute how this correlation changes when the degraded speech signal is processed by the computational model with or without efferent feedback. However, the way the correlation is computed clearly biases the results to favor processing by a model with efferent feedback.

      The result that the noise-vocoded speech has a higher correlation when processed with the efferent feedback on is therefore entirely expected, and not a revelation of the computational model. More surprising is the observation that, for speech in noise, the correlation value is larger without the efferent feedback. This could due to the scaling of loudness of the acoustic input (see point 1), but more detail is needed to pin this down. In summary, the computational model unfortunately does not allow for a meaningful conclusion.

      Response to Reviewer 1-Comment 5:

      While claims of bias would be understandable had we used shuffled auto-correlograms (SACs) to compare the expression of temporal fine structure (TFS) cues for natural speech versus vocoded stimuli (TFS cues reconstructed from the envelope of our vocoded stimuli would have differed dramatically from those original TFS cues in natural speech) (Shamma and Lorenzi, 2013), there is no inherent reason for SAC analysis of envelopes cues being biased towards either vocoded or speech-in-noise conditions as both stimuli retain the original envelope cues from natural speech. Indeed, since the purpose of our simulations was to compare the relative effects of adding efferent feedback on the reconstruction of the stimulus’ envelope cues in the AN for the two degraded stimuli, SACs offered a targeted analysis tool to extract the relevant information with fewer intermediate steps and presumptions than either encoder models or automatic speech recognition systems.

      We do agree with the reviewer that results of our simulations for the vocoded condition may have been less unexpected than those of speech-in-noise, as the envelopes of vocoded stimuli closely resemble those of natural speech in the absence of a masking noise. However, our results also demonstrate that adding efferent feedback could generate negative correlation changes for a number of vocoded words: either at individual frequencies (low and high spontaneous rate AN fibres (see raw data)) or on average across all frequencies tested [high spontaneous rate AN fibres only (Fig Supplement 3)]. This suggests that noise-vocoding speech (i.e. implementing the envelope from broader channel bandwidths while also scrambling spectrotemporal information in said channels) can disrupt envelope representation in the 1-2kHz range of certain words enough that efferent feedback should not be automatically presumed able to rectify their envelope cue reconstruction in AN fibres.

      As for the speech-in-noise conditions, our intuition for the negative correlation changes observed is that the signal-to-noise ratios (SNRs) tested were not large enough to allow for the isolated extraction of the target signal’s envelope by expanding the dynamic range of AN fibres. As the test stimuli and their SNRs were directly acquired by finding iso-performance in the psychophysical portion of this study (and appropriately normalized as input for the MAP_BS model), we consider the results of the simulation to be indicative of the actual benefit/disadvantage that activating efferent feedback might have on envelope representation of vocoded or speech-in-noise tasks in the AN [and not artefacts of poorly calibrated stimulus presentation level (see Responses to Reviewer1-Comment 1 and 6 for more details about methodology)]. Although this result may be surprising when viewed in the context of physiological and modelling studies demonstrating efferent feedback’s masking effect, our results may help to explain why MOCR anti-masking appears SNR- and stimulus- specific in numerous human studies (de Boer et al., 2012; Mertes et al., 2019).

      Reviewer 1-Comment 6: 6) The experiment on the ERPs in relation to the speech onsets is not properly controlled. In particular, the different acoustics of the considered speech signals -- speech in quiet, vocoded speech, speech in background noise -- will cause differences in excitation within the cochlea which will then affect every subsequent processing stage, from the brainstem and on to the cortex, thereby leading to different ERPs. As an example, babble noise allows for 'dip listening', while with its flat envelope speech-shaped noise does not. Analyzing differences in the ERPs with the goal of relating these to something different than the purely acoustic differences, such as to attention, would require these acoustic differences to be controlled, which is not the case in the current results.

      Response to Reviewer 1-Comment 6:

      Our fundamental methodological strategy was not to compare or even control the acoustics of the signals (although we did this to some extent by normalizing the presentation level and long-term spectrum across all signals), but instead to maintain iso-performance across conditions and, in doing so, allow the identification of brain mechanisms underlaying performance in a lexical decision task where speech intelligibility was manipulated.

      We do acknowledge the reviewer’s comment regarding acoustic differences across our speech signals. This is why in the Results section we describe that: “Early auditory cortical responses (P1 and N1) are largely driven by acoustic features of the stimulus (Getzmann et al., 2015; Grunwald et al., 2003)”. Therefore, our ERP analysis instead focuses on later, less stimulus-driven components such as P2, N400 and LPC: “Later ERP components, such as P2, N400 and the Late Positivity Complex (LPC), have been linked to speech- and task-specific, top-down (context-dependent) processes (Getzmann et al., 2015; Potts, 2004).”

      With regards to the reviewer’s example: “…babble noise allows for 'dip listening', while with its flat envelope speech-shaped noise does not”. We could argue that in our specific listening conditions “dip listening” did not offer a perceptual advantage over speech in speech shaped noise because:

      1) Higher SNR was required in the babble noise conditions to achieve the same level of performance than for the speech-shaped noise manipulations.

      2) Listeners have fewer chances to use the spectral and temporal dips compared to sentences(Rosen 2013) when listening to monosyllabic words (used in our study)

      3) The dips in the signal are expected to decrease both in depth and frequency with the number of talkers in a babble noise masker (8-talker babble used in our study), with no differences in masking effectiveness for more than 4-talker babble noise (Rosen et al., 2012).

      Overall, we believe that having modulated maskers effectively impaired speech intelligibility (Kwon and Turner 2001), but the most effective one was babble noise confirming that the best speech is its own best masker (Miller, 1947).

      Reviewer #2: [...] Reviewer 2-Comment 1: 1) A core premise of the experiment is that the non-invasive measures recorded in response to click sounds in one ear provide a direct measure of top-down modulation of responses to the speech sounds presented to the opposite ear. This is not acknowledged anywhere in the paper, and is simply not justifiable. The click and speech stimuli in the different ears will activate different frequency ranges and neural sources in the auditory pathway, as will the various noises added to the speech sounds. Furthermore, the click and speech sounds play completely different roles in the task, which makes identical top-down modulation illogical. The situation is further complicated by the fact that the clicks, speech and noise will each elicit MOCR activation in both ipsi- and contralateral ears via different crossed and uncrossed pathways, which implies different MOCR activation in the two ears.

      Response to Reviewer 2-Comment 1:

      We employed broadband clicks across all stimulus manipulations and listening conditions to activate the entire cochlea so that resulting OAEs could be used to measure modulation of cochlear gain by olivocochlear efferents.

      Historically, studies have applied clicks in one ear (to evoke OAEs) and a broadband noise suppressor in the other to monitor contralateral MOCR activation, demonstrating that clicks are suppressed consistently when subjects actively perform either an auditory (Froehlich et al., 1993, Maison et al., 2001; Garinis et al., 2011) or visual tasks (Puel et al., 1988; Froehlich et al., 1990; Avan & Bonfils 1992; Meric & Collet 1994). Therefore, while we acknowledge that the presence of clicks may have made the task of discriminating vocoded and words-in-noise more difficult, we would have expected to observe suppression of click-evoked OAEs for all stimulus manipulations whether subjects were actively or passively listening to speech stimuli in order to minimize the impact of the irrelevant clicks. In contrast, we observed that contralateral suppression of CEOAEs was both stimulus- and task-dependent. Unlike natural and vocoded speech, active listening of speech-in-noise did not produce significant MOCR activation; while passive listening (equivalent to visual attention) generated an MOCR effect in the opposite direction to their active-listening analogues for all 3 speech manipulations.

      Despite spectrotemporal, level and task-difficulty similarities between noise-vocoded speech and speech-in-noise manipulations, the stimulus-dependence of these results suggests that MOCR activation was controlled in a top-down manner according to the auditory scene presented. We speculated that this arises from improved peripheral processing of specific speech cues during active listening, whereas the opposite effects in passive listening are associated with attenuating auditory inputs to prioritize visual information. In line with this, we observed that introducing efferent feedback to our auditory periphery model differentially affected the auditory nerve output for the 3 most challenging speech manipulations: the resulting enhancement or deterioration of envelope cue representation offering an explanation for divergent patterns of MOCR gating for noise-vocoded and speech-in-noise.

      In summary, we predict that observed changes in CEOAE amplitudes in the contralateral ear will mirror cochlear gain inhibition in the ear processing speech. Bilateral descending control of the MOCR despite speech being presented monaurally is not unexpected for two reasons:

      1) Unlike simple pure tone stimuli, speech activates both left and right auditory cortices even when presented unilaterally to either ear (Heggdal et al., 2019)

      2) Cortical gating of the MOCR in humans does not appear restricted to direct ipsilaterally descending processes that impact cortical gain control in the opposite ear instead likely incorporating polysynaptic, decussating processes to affect both cochlear gain in both ears (Khalfa et al., 2001).

      Together this evidence makes it difficult to envisage a case where unilaterally-presented speech does not influence top-down control of cochlear gain bilaterally.

      Reviewer 2-Comment 2: 2) The vocoded conditions were recorded from a different group of participants than the masked speech conditions. Comparing between these two, which forms the essential point in this paper, is therefore highly confounded by inter-individual differences, which we know are substantial for these measures. More generally, the high variability of results in this research field should caution any strong conclusions based on comparing just these two experiments. A more useful approach would have been to perform the exact same task in the two experiments, to examine the reproducibility.

      Response to Reviewer 2-Comment 2:

      We ensured that the two populations tested across the three experiments were all normal hearing adults assessed using the same criteria. They were also age- and gender- matched and were recruited from undergraduate courses at Macquarie University (therefore presumably possessed similar literacy); however, we acknowledge this as an important issue and controlled for these issues, as far as we could, by:

      1) Ensuring that CEOAE SNRs were above a 6 dB minimum which allowed for more reliable and replicable recordings within and between subjects (Goodman et al., 2013).

      2) Carefully analysing and selecting ABR waveforms above the residual noise. Residual noise was calculated by applying a weighted average method based on Bayesian inference that weighs individual sweeps proportionally to their estimated precision (Box & Tiao, 1973). This helped preserve all trials without any rejection required for artefacts. ABR waveforms with residual noise equal to or higher than the averaged signal were discarded.

      3) Ensuring that individual ERP components represented a reliable individual average by: a) removing noisy trials (trials between -200 ms and 1.2 sec from sound onset which had absolute amplitude values higher than 75 μV) and b) maintaining between 60-80% of total trials per condition.

      In addition, we assessed potential differences across common variables between experiments such as, lexical performance during natural speech (see Results section), ABR components and CEOAE magnitude changes relative to the baseline during the Active and Passive listening of natural speech (as part of the 1st author’s thesis dissertation: Hernandez Perez, H., & Macquarie University. Department of Linguistics, degree granting institution. (2018). Disentangling the Influence of Attention in the Auditory Efferent System during Speech Processing / Heivet Hernandez Perez): “During active or passive listening of natural speech, no statistical differences between the populations assessed in the noise-vocoded and speech-in-noise experiments for: wave V-III amplitude ratio- Active listening [t (12) = 0.90, p=0.39], Passive listening: [t (23) = 1.58, p=0.13]; wave V-Active listening: [t (23) = 0.09, p=0.93]; Passive listening: [t (24) = -0.24, p=0.81]; CEOAE magnitude changes-Active listening [t (23) = -0.21, p=0.83; Passive listening [t (24) = -0.36, p=0.72].”

      These results ruled out the possibility that the effects observed across the three experiments were due to intrinsic differences between the populations tested. This would be discussed in a future version of the manuscript and added as supplemental material.

      Reviewer 2-Comment 3: 3) The interpretation presented here is essentially incompatible with the anti-masking model for the MOCR that first started of this field of research, in which the noise response is suppressed more than the signal, which is contradictory to the findings and model presented here, which suggest no role for the MOCR in improving speech in noise perception.

      Response to Reviewer 2-Comment 3:

      Physiological evidence for the MOCR anti-masking effect in animal models (Wiederhold, 1970; Winslow & Sachs 1987; Guinan & Gifford 1988; Kawase et al., 1993) has led to the hypothesis that the MOCR may play an important role in aiding humans to perceive speech in noise (Giraud et al., 1997; Liberman & Guinan 1998). The strictly non-invasive nature of human experiments has made measuring MOCR effects on OAE amplitudes the main technique for testing this anti-masking hypothesis. However, OAE inhibition (the MOCR-mediated reduction in OAE amplitude) has been reported as either increased (Giraud et al., 1997; Mishra and Lutman, 2014), reduced (de Boer et al., 2012; Harkrider and Bowers, 2009) or being unaffected (Stuart and Butler, 2012; Wagner et al., 2008) in participants with improved speech-in-noise perception. More recently, Mertes et al. (2019) suggested that the SNR used to explore speech-in-noise abilities might explain the contradicting results in the literature. The authors found that the MOCR only contributed to perception at the lowest SNR they tested (-12 dB), suggesting that the role of the MOCR for listening-in-noise may be highly dependent on the SNR, which in turns influences the extent to which the MOCR does or does not provide a benefit for hearing in noise. Therefore, our human and modelling data not only expands but also challenges the classical MOCR anti-masking effect by suggesting that, in humans, this effect is not only SNR-specific (which we controlled) but it is also task-specific (i.e whether participants are attending to the contralateral masker or not) and stimuli-dependent (i.e masker intrinsically noisy Vs signal-in-noise). We acknowledge that we can discuss further how our data advances the current state of the MOCR anti-masking effect in a future version of the manuscript.

      Reviewer 2-Comment 4: 4) The analysis of measures becomes increasingly selective and lacking in detail as the paper progresses: numerous 'outliers' are removed from the ABR recordings, with very uneven numbers of outliers between conditions. ABRs were averaged across conditions with no explicit justification. The statistical analysis of the ABRs is flawed as it does not compare across conditions (vocoded vs masked) but only within each condition separately (active v passive) - from which no across-condition difference can be inferred. The model simulation includes only 3 out of 9 active conditions. For the cortical responses, again only 3 conditions are discussed, with little apparent relevance.

      Response to Reviewer 2-Comment 4:

      In regard to the reviewer’s comment “The analysis of measures becomes increasingly selective and lacking in detail as the paper progresses: numerous 'outliers' are removed from the ABR recordings, with very uneven numbers of outliers between conditions. ABRs were averaged across conditions with no explicit justification.” During the analysis of the ABR measurements, we not only dealt with outliers but also with several missing data points (ABR components below the residual noise). The statistical analysis used to assess potential differences within ABR components was rANOVAs. This type of analysis is particularly restrictive when dealing with missing data points, because it will only include participants with all data available: (2 Conditions X 4 Stimuli manipulations for the noise vocoded experiment). This is why, ABR components’ sample sizes across experiments appeared uneven.

      Regarding the reviewer’s comment: “ABRs were averaged across conditions with no explicit justification.” Our rANOVA had the following design: Factor 1 (Conditions: Active Vs Passive); Factor 2 (Stimuli: natural, 8 channels noise vocoded (Voc8) …etc) and finally the Interaction (Conditions x Stimuli). ABR conditions were not simply averaged together; we only found a significant Conditions effect in the rANOVA that collapses all stimuli manipulations into Active Vs Passive conditions. Therefore, it was only statistically valid, to make inferences and potential interpretations about the Conditions main effect. This would be clarified in both the statistical design and in the Results section of a future version of this manuscript.

      In regard to the reviewer’s comment: “The statistical analysis of the ABRs is flawed as it does not compare across conditions (vocoded vs masked) but only within each condition separately (active v passive) - from which no across-condition difference can be inferred”. Up to this point in our data analysis, we were only interested in within-speech-manipulations comparisons (similar to the CEOAE analysis i.e, within noise-vocoded manipulations). We agree with the reviewer that a simple comparison between speech manipulations (noise-vocoded Vs masked speech) for the variables that are reflecting attentional changes (Active Vs Passive listening) could be useful to infer differences across experiments (noise-vocoded Vs speech-in-noise). This analysis will be added in a future version of the paper.

      Finally, regarding the comment:” The model simulation includes only 3 out of 9 active conditions. For the cortical responses, again only 3 conditions are discussed, with little apparent relevance”. At this stage of our analysis, we wanted to understand the potential reasons why the control of the cochlear gain appeared to be dependent on the way speech was being degraded i.e, noise vocoding the speech signal Vs speech-in-noise. Iso-performance being achieved in 3 task-difficulty levels, we thought to test how both the biophysical model and the auditory cortex (ERP components) would respond to the hardest and most challenging speech degradations (noise vocoded 8 channels, speech in babble noise +5 dB snr and speech in speech-shaped noise +3 dB snr) (see Figure 1B in Results section), where differences in the cochlear gain are most evident across experiments (see Figure 1B in Results section). In these extreme conditions we hypothesized that both the model and the auditory cortex activity would display the most obvious differences in the processing of the different speech degradations. We acknowledge the reviewer’s comment and in a future version of this manuscript, this line of thought will be more clearly described.

      Reviewer 2-Comment 5: 5) The assumption that changes in non-invasive measures, which represent a selective, random, mixed and jumbled by-product of underlying physiological processes, can be linked causally to auditory function, i.e. that changes in these responses necessarily have a definable and directional functional correlate in perception, is very tenuous and needs to be treated with much more caution.

      Response to Reviewer 2-Comment 5:

      We acknowledge the reviewer’s view about being cautious when interpreting non-invasive measures associated with human perception. However, the physiological measurements used in this study are not new in the field of auditory or speech perception, they are gold-standard methods to assess auditory function in both animal and human models. The novelty of our approach lays in imposing attentional states (Active listening) and (Passive listening) while concurrently probing along the auditory pathway in order to gain a holistic understanding of MOCR-mediated changes during a speech comprehension task. The strength of our methodology arises from extensively and continuously monitoring both the attentional states and the quality of our physiological measurements.

      Reviewer #3: [...] Reviewer 3-Comment 1: 1) However, I have several substantial concerns with the design, conceptualization, data analysis and interpretation of the results. I have had challenges to understand the hypotheses and rationale behind this study. A number of experimental paradigms have been employed, including peripheral/brainstem physiological measure, as well as cortical auditory responses during active versus 'passive' listening. Different noise conditions were tested but it is not clear to me what rationale was behind these stimulus choices. The authors claim that "our data comparing active and passive listening conditions highlight a categorical distinction between speech manipulation, a difference between processing a single, but degraded, auditory stream (vocoded speech) and parsing a complex acoustic scene to hear out a stream from multiple competing and spectrally similarly sounds" (lines 401-403). This seems like too much of a mouthful. I cannot see that the data support this pretty broad interpretation.

      Response to Reviewer 3-Comment 1:

      The main objective of this study is to examine the role of the auditory efferent system in active vs. passive listening tasks for three commonly employed speech manipulations. To address this, speech intelligibility was degraded in three ways: 1) noise vocoding the speech signal; 2) adding babble noise (BN) to the speech signal at different SNRs or 3) adding speech-shaped noise (SSN) to the speech signal at different SNRs. The reason for using noise-vocoded speech while contralaterally recording CEOAEs is that it allowed speech intelligibility to be manipulated without increasing noise levels (a classical way of evoking the MOCR (Berlin et al., 1993; Norman & Thornton 1993; Kalaiah et al., 2017b)). This avoided confounding CEOAE magnitude changes due to purely stimulus-driven MOCR activation with attention-driven MOCR on CEOAE magnitudes. Moreover, because the level of the speech spectrum decreases with increasing frequency, white noise (which is the most commonly used stimulus to evoke MOCR in the literature) predominantly masks only the high frequency component of the speech signal, therefore it is not considered an efficient speech masker. However, BN (besides representing a more ethological auditory type of noise) and SSN (which is the spectrally matched long-term averaged of the speech signal) have the same long-term average spectrum as speech. Therefore, these noises were able to mask the speech signal equally across frequencies.

      Reviewer 3-Comment 2: 2) Despite maintaining iso-difficulty between vocoded vs speech-in-noise (SIN) conditions, the authors neither address (a) the fundamental differences in understanding vocoded vs. SIN speech nor (b) any theoretical basis for how the noise manifests in vocoded speech. If the tasks are indeed so obviously 'categorically' different - then it should not be surprising they engage different processing (the 'denoising' may not be comparable). I would prefer much more clearly defined and targeted hypotheses and a justification of the specific stimulus and paradigm choices to test such hypotheses. It appears to me that numerous measures have been obtained (reflecting in fact very different processes along the auditory pathway) and then it has been attempted to make up some coherent conclusions from these data - but the assumptions are not clear, the data are very complex and many aspects of the discussion are speculative. To me, the most interesting element is the reversal of the MOCR behavior in the attended vs ignored conditions. However, ignoring a stimulus is not a passive task! It would have been interesting to also see cortical unattended results.

      Response to Reviewer 3-Comment 2:

      The motivation behind this study arises from controversy in the literature regarding attentional effects at both the level of the cochlear (via MOCR) and the brainstem. Previous studies of attentional effects on CEOAEs have not only prevented direct comparison among them but have also distorted the interpretation of their results. Most have implemented paradigms with large differences in their arousal state [or alertness levels (Eysenck, 2012)] and stimulus type between the active auditory task (e.g. speech stimuli presented while CEOAEs are recorded) and passive listening conditions (no task, CEOAEs recorded during no-noise conditions or with-noise conditions) (Froehlich et al., 1990; Meric et al., 1994; Srinivasan et al., 2012). Our experimental paradigm addressed these issues in three main ways: 1) using the same stimuli for both active and passive listening conditions; 2) using a controlled visual scene across the experimental sessions; and 3) attempting to control for differences in alertness during the passive condition by asking subjects to watch an engaging cartoon movie. The homogeneity of visual and auditory scenes across the experiments allowed the effects of attending to the speech on CEOAE magnitude to be disentangled from the stimulus-driven effects.

      In addition, it was never assumed that the “Passive listening” or the “auditory-ignored” condition was a passive task. In this condition subjects were asked to ignore the auditory stimuli and to watch a non-subtitled, stop-motion movie. To ensure participants’ attention during this condition, they were monitored with a video camera and were asked questions at the end of this session (e.g. What happened in the movie? How many characters were present?) (See Methods section). The aim of a passive or an auditory-ignoring condition is to shift attentional resources away from the auditory scene and towards the visual scene. As shown in (Figure supplement 4) all ERP components were also obtained in the Passive listening condition and they are of a smaller magnitude than ERP components observed in the active listening conditions, demonstrating that cortical representation of the speech-onset was enhanced in all active listening conditions.

      Reviewer 3-Comment 3: 2) Overall, I'm struggling with this study that touches upon various concepts and paradigms (efferent feedback, active vs. passive listening, neural representation of listening effort, modeling of efferent signal processing, stream segregation, speech-in-noise coding, peripheral vs cortical representations...) where each of them in isolation already provides a number of challenges and has been discussed controversially. In my view, it would be more valuable to specify and clarify the research question and focus on those paradigms that can help verify or falsify the research hypotheses.

      Response to Reviewer 3-Comment 3:

      In our study, we sought to explore how active listening of degraded speech modulates CEOAE magnitudes (as a proxy for efferent-MOCR effects). With the specific Research question: Does auditory attention modulate cochlear gain, via the auditory efferent system, in a task-dependent manner? and Hypothesis: Decreases in speech intelligibility raise auditory attention and this reduces cochlear gain (measured using CEOAEs).

      In particular, unlike previously published studies, we assessed auditory changes objectively and subjectively as part of a highly controlled experimental paradigm, maintaining a constant performance across three experimental manipulations of speech intelligibility as well as minimizing influences of MEMR activation and controlling for homogeneity of both visual and auditory scenes across conditions. We agree with the reviewer that due to the complexity of our study, each section should be more explicit in its hypothesis and aims. This will be clarified in a future version of this manuscript.

  6. Aug 2020
    1. Mateus, J., Grifoni, A., Tarke, A., Sidney, J., Ramirez, S. I., Dan, J. M., Burger, Z. C., Rawlings, S. A., Smith, D. M., Phillips, E., Mallal, S., Lammers, M., Rubiro, P., Quiambao, L., Sutherland, A., Yu, E. D., Antunes, R. da S., Greenbaum, J., Frazier, A., … Weiskopf, D. (2020). Selective and cross-reactive SARS-CoV-2 T cell epitopes in unexposed humans. Science. https://doi.org/10.1126/science.abd3871

    1. Felipe, L. S., Vercruysse, T., Sharma, S., Ma, J., Lemmens, V., Looveren, D. van, Javarappa, M. P. A., Boudewijns, R., Malengier-Devlies, B., Kaptein, S. F., Liesenborghs, L., Keyzer, C. D., Bervoets, L., Rasulova, M., Seldeslachts, L., Jansen, S., Yakass, M. B., Quaye, O., Li, L.-H., … Dallmeier, K. (2020). A single-dose live-attenuated YF17D-vectored SARS-CoV2 vaccine candidate. BioRxiv, 2020.07.08.193045. https://doi.org/10.1101/2020.07.08.193045

    1. Ferretti, A. P., Kula, T., Wang, Y., Nguyen, D. M., Weinheimer, A., Dunlap, G. S., Xu, Q., Nabilsi, N., Perullo, C. R., Cristofaro, A. W., Whitton, H. J., Virbasius, A., Olivier, K. J., Baiamonte, L. B., Alistar, A. T., Whitman, E. D., Bertino, S. A., Chattopadhyay, S., & MacBeath, G. (2020). COVID-19 Patients Form Memory CD8+ T Cells that Recognize a Small Set of Shared Immunodominant Epitopes in SARS-CoV-2. MedRxiv, 2020.07.24.20161653. https://doi.org/10.1101/2020.07.24.20161653

    1. Zhu, F.-C., Guan, X.-H., Li, Y.-H., Huang, J.-Y., Jiang, T., Hou, L.-H., Li, J.-X., Yang, B.-F., Wang, L., Wang, W.-J., Wu, S.-P., Wang, Z., Wu, X.-H., Xu, J.-J., Zhang, Z., Jia, S.-Y., Wang, B.-S., Hu, Y., Liu, J.-J., … Chen, W. (2020). Immunogenicity and safety of a recombinant adenovirus type-5-vectored COVID-19 vaccine in healthy adults aged 18 years or older: A randomised, double-blind, placebo-controlled, phase 2 trial. The Lancet, 0(0). https://doi.org/10.1016/S0140-6736(20)31605-6

    1. Yonker, L. M., Neilan, A. M., Bartsch, Y., Patel, A. B., Regan, J., Arya, P., Gootkind, E., Park, G., Hardcastle, M., John, A. S., Appleman, L., Chiu, M. L., Fialkowski, A., Flor, D. D. la, Lima, R., Bordt, E. A., Yockey, L. J., D’Avino, P., Fischinger, S., … Fasano, A. (2020). Pediatric SARS-CoV-2: Clinical Presentation, Infectivity, and Immune Responses. The Journal of Pediatrics, 0(0). https://doi.org/10.1016/j.jpeds.2020.08.037

    1. Bangaru, S., Ozorowski, G., Turner, H. L., Antanasijevic, A., Huang, D., Wang, X., Torres, J. L., Diedrich, J. K., Tian, J.-H., Portnoff, A. D., Patel, N., Massare, M. J., Yates, J. R., Nemazee, D., Paulson, J. C., Glenn, G., Smith, G., & Ward, A. B. (2020). Structural analysis of full-length SARS-CoV-2 spike protein from an advanced vaccine candidate. BioRxiv, 2020.08.06.234674. https://doi.org/10.1101/2020.08.06.234674

  7. Jul 2020