Reviewer #2 (Public review):
Summary:
This very ambitious project addresses one of the core questions in visual processing related to the underlying anatomical and functional architecture. Using a large sample of rare and high-quality EEG recordings in humans, the authors assess whether face-selectivity is organised along a posterior-anterior gradient, with selectivity and timing increasing from posterior to anterior regions. The evidence suggests that it is the case for selectivity, but the data are more mixed about the temporal organisation, which the authors use to conclude that the classic temporal hierarchy described in textbooks might be questioned, at least when it comes to face processing.
Strengths:
A huge amount of work went into collecting this highly valuable dataset of rare intracranial EEG recordings in humans. The data alone are valuable, assuming they are shared in an easily accessible and documented format. Currently, the OSF repository linked in the article is empty, so no assessment of the data can be made. The topic is important, and a key question in the field is addressed. The EEG methodology is strong, relying on a well-established and high SNR SSVEP method. The method is particularly well-suited to clinical populations, leading to interpretable data in a few minutes of recordings. The authors have attempted to quantify the data in many different ways and provided various estimates of selectivity and timing, with matching measures of uncertainty. Non-parametric confidence intervals and comparisons are provided. Collectively, the various analyses and rich illustrations provide superficially convincing evidence in favour of the conclusions.
Weaknesses:
(1) The work was not pre-registered, and there is no sample size justification, whether for participants or trials/sequences. So a statistical reviewer should assess the sensitivity of the analyses to different approaches.
(2) Frequentist NHST is used to claim lack of effects, which is inappropriate, see for instance:
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337-350. https://doi.org/10.1007/s10654-016-0149-3
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520-547. https://doi.org/10.1111/tops.12214
(3) In the frequentist realm, demonstrating similar effects between groups requires equivalence testing, with bounds (minimum effect sizes of interest) that should be pre-registered:
Campbell, H., & Gustafson, P. (2024). The Bayes factor, HDI-ROPE, and frequentist equivalence tests can all be reverse engineered-Almost exactly-From one another: Reply to Linde et al. (2021). Psychological Methods, 29(3), 613-623. https://doi.org/10.1037/met0000507
Riesthuis, P. (2024). Simulation-Based Power Analyses for the Smallest Effect Size of Interest: A Confidence-Interval Approach for Minimum-Effect and Equivalence Testing. Advances in Methods and Practices in Psychological Science, 7(2), 25152459241240722. https://doi.org/10.1177/25152459241240722
(4) The lack of consideration for sample sizes, the lack of pre-registration, and the lack of a method to support the null (a cornerstone of this project to demonstrate equivalence onsets between areas), suggest that the work is exploratory. This is a strength: we need rich datasets to explore, test tools and generate new hypotheses. I strongly recommend embracing the exploration philosophy, and removing all inferential statistics: instead, provide even more detailed graphical representations (include onset distributions) and share the data immediately with all the pre-processing and analysis code.
(5) Even if the work was pre-registered, it would be very difficult to calculate p-values conditional on all the uncertainty around the number of participants, the number of contacts and the number of trials, as they are random variables, and sampling distributions of key inferences should be integrated over these unknown sources of variability. The difficulty of calculating/interpreting p-values that are conditional on so many pre-processing stages and sources of uncertainty is traditionally swept under the rug, but nevertheless well documented:
Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603. https://pubmed.ncbi.nlm.nih.gov/22774788/
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804. https://doi.org/10.3758/BF03194105<br />
https://link.springer.com/article/10.3758/BF03194105
(6) Currently, there is no convincing evidence in the article to clearly support the main claims.
Bootstrap confidence intervals were used to provide measures of uncertainty. However, the bootstrapping did not take the structure of the data into account, collapsing across important dependencies in that nested structure: participants > hemispheres > contacts > conditions > trials.
Ignoring data dependencies and the uncertainty from trials could lead to a distorted CI. Sampling contacts with replacement is inappropriate because it breaks the structure of the data, mixing degrees of freedom across different levels of analysis. The key rule of the bootstrap is to follow the data acquisition process, and therefore, sampling participants with replacement should come first. In a hierarchical bootstrap, the process can be repeated at nested levels, so that for each resampled participant, then contacts are resampled (if treated as a random variable), then trials/sequences are resampled, keeping paired measurements together (hemispheres, and typically contacts in a standard EEG experiment with fixed montage). The same hierarchical resampling should be applied to all measurements and inferences to capture all sources of variability. Selectivity and timing should be quantified at each contact after resampling of trials/sequences before integrating across hemispheres and participants using appropriate and justified summary measures.
The authors already recognise part of the problem, as they provide within-participant analyses. This is a very good step, inasmuch as it addresses the issue of mixing-up degrees of freedom across levels, but unfortunately these analyses are plagued with small sample sizes, making claims about the lack of differences even more problematic--classic lack of evidence == evidence of absence fallacy. In addition, there seem to be discrepancies between the mean and CI in some cases: 15 [-20, 20]; 8 [-24, 24].
(7) Three other issues related to onsets:
(a) FDR correction typically doesn't allow localisation claims, similarly to cluster inferences:
Winkler, A. M., Taylor, P. A., Nichols, T. E., & Rorden, C. (2024). False Discovery Rate and Localizing Power (No. arXiv:2401.03554). arXiv. https://doi.org/10.48550/arXiv.2401.03554
Rousselet, G. A. (2025). Using cluster-based permutation tests to estimate MEG/EEG onsets: How bad is it? European Journal of Neuroscience, 61(1), e16618. https://doi.org/10.1111/ejn.16618
(b) Percentile bootstrap confidence intervals are inaccurate when applied to means. Alternatively, use a bootstrap-t method, or use the pb in conjunction with a robust measure of central tendency, such as a trimmed mean.
Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881
(c) Defining onsets based on an arbitrary "at least 30 ms" rule is not recommended:
Piai, V., Dahlslätt, K., & Maris, E. (2015). Statistically comparing EEG/MEG waveforms through successive significant univariate tests: How bad can it be? Psychophysiology, 52(3), 440-443. https://doi.org/10.1111/psyp.12335
(8) Figure 5 and matching analyses: There are much better tools than correlations to estimate connectivity and directionality. See for instance:
Ince, R. A. A., Giordano, B. L., Kayser, C., Rousselet, G. A., Gross, J., & Schyns, P. G. (2017). A statistical framework for neuroimaging data analysis based on mutual information estimated via a Gaussian copula. Human Brain Mapping, 38(3), 1541-1573. https://doi.org/10.1002/hbm.23471
(9) Pearson correlation is sensitive to other features of the data than an association, and is maximally sensitive to linear associations. Interpretation is difficult without seeing matching scatterplots and getting confirmation from alternative robust methods.