Reviewer #3 (Public review):
This paper investigates invariance to natural background noise in the auditory cortex of ferrets and humans. The authors first replicate, in ferrets, a finding from human neuroimaging showing that invariance to background noise increases along the cortical hierarchy (i.e., from primary to non-primary auditory cortex). Next, the authors ask whether this pattern of invariance could be explained by differences in tuning to low-level acoustic features across primary and non-primary regions. The authors conclude that this tuning can explain the spatial organization of background invariance in ferrets, but not in humans. The conclusions of the paper are generally well supported by the data, but additional control analyses are needed to fully substantiate the paper's claims. Finally, additional discussion and potentially analysis, are needed to reconcile these findings with similar work in the literature (particularly that of Hamersky et al. 2025 J. Neurosci.).
The paper is very straightforwardly written, with a generally clear presentation including well-designed and visually appealing figures. Not only does this paper provide an important replication in a non-human animal model commonly used in auditory neuroscience, but it also extends the original findings in three ways. First, the authors reveal a more fine-grained gradient of background invariance by showing that background invariance increases across primary, secondary, and tertiary cortical regions. Second, the authors address a potential mechanism that might underlie this pattern of invariance by considering whether differences in tuning to frequency and spectrotemporal modulations across regions could account for the observed pattern of invariance. The spectrotemporal modulation encoding model used here is a well-established approach in auditory neuroscience and seems appropriate for exploring potential mechanisms underlying invariance in auditory cortex, particularly in ferrets. However, as discussed below, the analyses based on this simple encoding model are only informative to the extent that the model accurately captures neural responses. Thus, its limitations in modeling non-primary human auditory cortex should be considered when interpreting cross-species comparisons. Third, the authors provide a more complete picture of invariance by additionally analyzing foreground invariance, a complementary measure not explored in the original study. While this analysis feels like a natural extension and its inclusion is appreciated, the interpretation of these foreground invariance findings remains somewhat unclear, as the authors offer limited discussion of their significance or relation to existing literature.
As mentioned above, interpretation of the invariance analyses using predictions from the spectrotemporal modulation encoding model hinges on the model's ability to accurately predict neural responses. Although Figure S5 suggests the encoding model was generally able to predict voxel responses accurately, the authors note in the introduction that, in human auditory cortex, this kind of tuning can explain responses in primary areas but not in non-primary areas (Norman-Haignere & McDermott, PLOS Biol. 2018). Indeed, the prediction accuracy histograms in Figure S5C suggest a slight difference in the model's ability to predict responses in primary versus non-primary voxels. Additional analyses should be done to a) determine whether the prediction accuracies are meaningfully different across regions and b) examine whether controlling for prediction accuracy across regions (i.e., sub-selecting voxels across regions with matched prediction accuracy) affects the outcomes of the invariance analyses.
A related concern is the procedure used to train the encoding model. From the methods, it appears that the model may have been fit using responses to both isolated and mixture sounds. If so, this raises questions about the interpretability of the invariance analyses. In particular, fitting the model to all stimuli, including mixtures, may inflate the apparent ability of the model to "explain" invariance, since it is effectively trained on the phenomenon it is later evaluated on. Put another way, if a voxel exhibits invariance, and the model is trained to predict the voxel's responses to all types of stimuli (both isolated sounds and mixtures), then the model must also show invariance to the extent it can accurately predict voxel responses, making the result somewhat circular. A more informative approach would be to train the encoding model only on responses to isolated sounds (or even better, a completely independent set of sounds), as this would help clarify whether any observed invariance is emergent from the model (i.e., truly a result of low-level tuning to spectrotemporal features) or simply reflects what it was trained to reproduce.
Finally, the interpretation of the foreground invariance results remains somewhat unclear. In ferrets (Figure 2I), the authors report relatively little foreground invariance, whereas in humans (Figure 5G), most participants appear to show relatively high levels of foreground invariance in primary auditory cortex (around 0.6 or greater). However, the paper does not explicitly address these apparent cross-species differences. Moreover, the findings in ferrets seem at odds with other recent work in ferrets (Hamersky et al. 2025 J. Neurosci.), which shows that background sounds tend to dominate responses to mixtures, suggesting a prevalence of foreground invariance at the neuronal level. Although this comparison comes with the caveat that the methods differ substantially from those used in the current study, given the contrast with the findings of this paper, further discussion would nonetheless be valuable to help contextualize the current findings and clarify how they relate to prior work.
