Reviewer #3 (Public review):
Summary:
This valuable study presents a tool that uses brain anatomy to predict the layout and size of early visual maps, and it is strengthened by testing across a large and diverse collection of scans. The work will be useful for researchers who want to estimate likely visual map layout from standard anatomical scans and to relate anatomical differences to differences in visual organization across groups. The evidence is solid for the general usefulness of the approach, but incomplete for broader claims about prediction accuracy and use across datasets, particularly for estimates of map size and for showing that the model improves on repeated functional measurements.
Strengths:
The paper addresses a useful and important problem: estimating early visual map organization from anatomical measurements alone. Tools that predict these types of functional data from anatomical measurements were introduced more than a decade ago by Benson and colleagues, and the present authors have significantly extended that work. That is a real strength of the manuscript, because there is genuine value in having a practical tool that can estimate likely visual organization from standard anatomical scans.
Another major strength is the rigorous cross-dataset benchmarking and the accumulation of multiple datasets. The authors assembled a large and diverse set of scans and assessed model performance across different scanners, field strengths, and visual stimuli, which gives the reader a much better sense of how broadly the approach may apply. The retrospective analysis of more than 11,000 scans is especially notable and creates an unusual opportunity to ask how anatomical variation may relate to population differences in visual organization.
I also think the paper does a good job of showing why such a tool could matter in practice. A complete tool could be used in several ways. First, it could help users identify the locations of activations measured in other experiments with respect to the typical V1-V3 maps. Second, maps measured from an individual subject or patient could be compared with the predictions from the tool to ask whether they differ meaningfully from a standard anatomy-based map. Third, the tool can be used, as the authors have done here, to examine differences in anatomy across populations and interpret these differences with respect to retinotopic maps. Of these uses, the first already seems well supported by the current presentation.
Weaknesses:
(1) Quantification of the Analysis
My main concern is that the analysis relies heavily on global summary measures such as correlation and Dice score. Those measures are useful, but the paper would be more informative if it also quantified boundary differences in millimeters, especially for comparisons such as the V1/V2 boundary in Figure 2. That kind of analysis would help readers understand how large the errors are in physically meaningful terms.
(2) Model fitting methods
I also think the discussion of prediction failures for pRF size should be more explicit. The mismatch is likely influenced by the fact that the training data and several evaluation datasets were fit with different models and different analysis software. In particular, the network was trained on non-linear size estimates from the HCP data, while the comparison datasets were derived using other packages and, in some cases, different model assumptions. That likely contributes to the spread in Figure 3b and should be discussed more directly. It is important to discuss that the pRF parameters were derived using different software tools.
- HCP dataset (training data): analyzePRF (Compressive Spatial Summation model)
- NYU dataset: vistasoft
- Stanford dataset: vistasoft
- New Zealand dataset: SamSrf
- CHN dataset: Custom MATLAB software
(3) Clarifying Model Accuracy
If deepRetinotopy generates a true "noise-removed" representation of functional mapping based on anatomy, then fitting it to one fMRI measurement should predict a second, independent fMRI run better than the noisy data from the first run does.
The authors possess the exact data for this test. For the HCP dataset, the empirical fMRI data were explicitly separated into two halves: "fit 2" (the first half of the fMRI runs) and "fit 3" (the second half). They correlated these two halves to establish a "noise ceiling," the maximum possible reliability of the data. Looking at their results in Figure 2b, the correlation of the deepRetinotopy predictions falls below this noise ceiling. This means that the noisy functional Half 1 actually predicts functional Half 2 better than the anatomical model does.
The authors should state this explicitly. A side-by-side plot of Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2 would show that the anatomical model regularizes map location well, but misses reliable subject-specific variation that anatomy alone cannot capture.
(4) The Hemodynamic Response Function
The assumptions used to generate the original empirical maps are permanently baked into the deep learning model. However, the authors explicitly mention the hemodynamic response function (HRF) only once, noting in the Methods that the modeled time series was "convolved with a canonical hemodynamic response function."
Beyond this single mention, there is no direct discussion of how the assumption of a single canonical HRF across all 161 HCP training subjects might have systematically impacted or biased the network's predictions. The authors address cross-dataset differences broadly under the umbrella of "experimental design" and "fMRI preprocessing pipeline" biases, but the HRF is a core biological property that mediates the connection between the anatomy and the data. The authors should explicitly discuss how this canonical assumption limits or biases the resulting deepRetinotopy network.
(5) Scoping the Input Data and Normative Use
The authors use FreeSurfer to generate a mean curvature map for the entire midthickness cortical surface. This full-hemisphere curvature map is resampled to a standard template surface space (32k_fs_LR), acting as the data frame that feeds input features into the neural network. However, while the network receives the full geometric structure of the hemisphere, it is explicitly trained to predict retinotopic parameters only within a restricted posterior ROI, based on the Wang et al. atlas and containing roughly 3,200 vertices per hemisphere.
A useful experiment to try, and perhaps the authors have already considered this, would be to restrict the input features exclusively to the posterior vertices. Including all anterior vertices may make it harder for the network to fit the localized visual data. A brief commentary on why the full hemisphere was retained as input could be highly informative for researchers adapting this geometric deep learning pipeline.
