Reviewer #1 (Public review):
Porte et al. investigate how observers form confidence judgments about the presence vs absence of near-threshold audiovisual stimuli. In two psychophysical detection experiments, human participants judged whether a stimulus (visual, auditory, or audiovisual) was present or absent, reported amodal confidence, and then gave modality-specific detection and confidence ratings using a bidimensional scale. The authors report that audiovisual (AV) stimuli are detected more accurately than unimodal stimuli, but that multisensory stimulation does not improve metacognitive efficiency. Participants are more confident in absence than in presence judgments. They extend a previously proposed model to an audiovisual setting, assuming evidence is available only for presence and that absence is inferred via counterfactual detectability. Detection is modeled with a disjunctive integration rule across modalities, while confidence is explained by a combination of conjunctive (for presence) and disjunctive/negation-of-disjunction (for absence) rules.
There are several points I wish to have clarified, outlined below:
(1) Framing of bimodal vs unimodal detection
On p.3, the introduction states that "Adults typically show higher detection rates and faster reaction times for bimodal than for unimodal stimuli." This is broadly consistent with the literature, but as written, it obscures the fact that these effects depend critically on experimenter-defined stimulus strengths. It is trivial to construct cases where a strong unimodal stimulus is more detectable than a bimodal stimulus made of two very weak unimodal stimuli. If "bimodal" is understood as the co-presentation of two unimodal components matched in detectability, then Bayes-rule-based arguments indeed predict better detection for the bimodal case; how much better is theoretically interesting, but not quantified in this paper. There is an entire literature on the combination of two unimodal stimuli, which is not touched on. For a pertinent reference, see Ernst & Banks 2002. I recommend clarifying that the statement assumes comparable unimodal intensities.
(2) Relationship to signal detection theory and counterfactual perceptibility
In the introduction, the authors write, "If sensory evidence is only available for presence," motivating counterfactual perceptibility as a necessary ingredient to infer absence. However, standard signal detection theory (SDT) already provides a widely accepted framework in which a continuous internal response is present on both signal and noise (absent) trials, with absence corresponding to the noise distribution and decisions implemented by a criterion.
Thus, there is no logical need to invoke counterfactual perceptibility simply to define absence; rather, the Mazor-style framework adds an explicit belief model about detectability and an optimal stopping policy. It would strengthen the paper to more clearly state how the proposed model goes beyond SDT conceptually, acknowledge that SDT can account for presence/absence decisions without counterfactuals, and position the counterfactual account as a hypothesis about how observers actually compute absence/confidence, not as a necessity.
One of the central claims of the paper is that detection in the case of absence requires counterfactual reasoning. The authors should demonstrate whether or not an SDT-based generative model can describe these amodal and uni- and bi-modal stimulus decisions. In such an SDT model, an SDT-based generative model in which the noise distribution is shared across conditions, and unimodal vs bimodal differences are captured by changes in the mean or variance of the signal+noise distribution.
(3) Confidence vs performance: is AV confidence special?
The paper's central claims about multisensory confidence and metacognition would be stronger if the authors showed that AV confidence deviates from what is expected given performance alone. From the reported results, AV accuracy is around 80%, with visual and auditory at about 60% and 40%, respectively. Given that confidence typically monotonically scales with accuracy, the first question is whether AV confidence is entirely explained by improved performance, or whether there is an additional multisensory contribution. A simple, informative analysis would be for each subject, plot mean confidence vs per cent correct for AV, V, A, and absent conditions, and to test whether AV confidence lies above the trend predicted by accuracy alone.
(4) Metacognitive measures: logistic regression slopes vs meta-d′/d′
In the "Multisensory effects on metacognitive performance" section, the authors define "metacognitive sensitivity" as the slope of a Bayesian logistic regression predicting accuracy from confidence. There is substantial literature showing that logistic-slope measures of metacognitive sensitivity are criterion-dependent and can be affected by both task and confidence criteria (for one example, see Rausch & Zehetleitner, 2017). In contrast, meta-d′/d′ was specifically developed to provide a bias-invariant measure of metacognitive efficiency. Though this, too, is dated (see Boundy-Singer et al., 2023). Given that the authors already estimate HMeta-d-based M-ratios, it is unclear why they rely on logistic regression slopes as their primary "metacognitive sensitivity" metric in Figure 4A. I suggest either replacing the logistic-slope metric with SDT-based measures (meta-d′, meta-d′/d′) or providing a clear justification for using logistic slopes, along with a discussion of their known limitations.
Additionally, Figure 3 reports M-ratios without showing the corresponding d′ or meta-d′ for judge-present vs judge-absent conditions. Presenting these would help contextualize the metacognitive efficiency results and clarify whether differences are driven mainly by changes in metacognitive sensitivity, changes in task performance, or both. The d' values per condition could be added to Figure 2A.
(5) Interpretation of confidence in absence vs presence
The authors emphasise that it is surprising subjects are more confident in absence than in presence judgments, both at amodal and modality-specific levels. However, Figure 2B suggests that absent responses are very accurate: absent is reported as present only in about 10% of absent trials, implying a high correct rejection rate. If confidence tracks outcome probability, higher confidence for absence may be at least partly expected. Before attributing this asymmetry primarily to counterfactual reasoning, it would be important to explicitly relate confidence to accuracy for hits, misses, false alarms, and correct rejections and show whether absence confidence remains elevated relative to presence after controlling for accuracy differences across judgment types and conditions. Without this, the interpretation that higher absence confidence is inherently "unexpected" seems overstated.
(6) Model: integration rules, confidence, and evidence strength
The modeling section extends the Mazor et al. ideal observer to two modality-specific sensors, with disjunctive integration for detection and then disjunctive vs conjunctive integration rules for confidence. I have a few comments.
First, the detection rule is disjunctive and is reported as a finding. However, the conclusion that detection relies on a disjunctive rule ("present if A or V") closely mirrors the task instructions-participants are explicitly told to respond "present" if they detect the stimulus in any modality. As such, this seems more like a sanity check than a novel empirical finding.
Relatedly, the conjunctive detection is a weak null. The conjunctive rule ("present only if both A and V") is behaviorally implausible given the task instructions. A more informative baseline would be an SDT-style scalar-evidence model (see comment 2), rather than a conjunctive rule that participants would have to actively violate the instructions to follow.
Second, confidence in the model is defined as the probability of being correct at the time of the detection decision. However, this implies a fixed amount of evidence at decision time unless additional mechanisms are invoked. This issue is well known in diffusion modeling (see Kiani et al. 2014) and deserves explicit discussion; otherwise, it is unclear how the model produces graded confidence from a bound-crossing rule alone.
Third, the authors do not consider a straightforward evidence-strength account of confidence. When both modalities indicate presence, there is, on average, more total sensory evidence than in unimodal trials, making correct decisions more likely and, under most frameworks, confidence higher. Likewise, weak evidence in both modalities can be stronger evidence for absence than moderate in one and weak in the other. Many of the patterns that motivate the presence-conjunctive/absence-disjunctive mix could arise from a model where confidence simply reflects the amount of evidence for the chosen option, without positing distinct logical integration rules for presence vs absence. As the authors note, purely disjunctive or purely conjunctive confidence rules fail to capture the trends in confidence reports in Figure 7, leading them to adopt a combined presence-conjunctive / absence-disjunctive rule. A more parsimonious alternative-that confidence scales with evidence magnitude and cross-modal agreement-should be explicitly considered and, ideally, implemented as a competing model.
Finally, if the model is intended as a good account of the data, it would be useful to report whether it also reproduces the metacognitive efficiency patterns (M-ratios) beyond the mean confidence patterns shown in Figures 7-8. At present, the model appears systematically over-confident, which should be acknowledged and quantified.
(7) Confidence asymmetry index (CAI) and modality weighting
The confidence asymmetry index (CAI) is defined as the difference between auditory and visual confidence on AV vs absent trials, and the authors report strong correlations between observed and simulated CAI across participants. They interpret this as evidence that subjects place different weights on auditory vs visual signals. Several questions arise. First, does CAI capture asymmetries beyond what is expected from accuracy differences between modalities and conditions? Second, because the simulated data are generated from model fits to the observed data, a correlation between observed and simulated CAI is expected: the model is built to reproduce the individual patterns it is then compared to. A stronger test would compare CAI from data simulated with modality-specific belief parameters, versus CAI from data simulated with constrained equal belief parameters (same θs). Relatedly, the paper would benefit from a plot showing the distribution of θs for A and V- present stimuli across subjects. These values could also be related to unimodal sensitivity measured in the calibration/training phases. A natural prediction is that higher unimodal sensitivity should correspond to higher belief parameters for presence.