Reviewer #2:
In this paper, Fiscella and colleagues report the results of behavioral experiments on auditory perception in healthy participants. The paper is clearly written, and the stimulus manipulations are well thought out and executed.
In the first experiment, audiovisual speech perception was examined in 15 participants. Participants identified keywords in English sentences while viewing faces that were either dynamic or still, and either upright or rotated. To make the task more difficult, two irrelevant masking streams (one audiobook with a male talker, one audiobook with a female talker) were added to the auditory speech at different signal-to-noise ratios for a total of three simultaneous speech streams.
The results of the first experiment were that both the visual face and the auditory voice influenced accuracy. Seeing the moving face of the talker resulted in higher accuracy than a static face, while seeing an upright moving face was better than a 90-degree rotated face which was better than an inverted moving face. In the auditory domain, performance was better when the masking streams were less loud.
In the second experiment, 23 participants identified pitch modulations in auditory speech. The task of the participants was considerably more complicated than in the first experiment. First, participants had to learn an association between visual faces and auditory voices. Then, on each trial, they were presented with a static face which cued them which auditory voice to attend to. Then, both target and distracter voices were presented, and participants searched for pitch modulations only in the target voice. At the same time, audiobook masking streams were presented, for a total of 4 simultaneous speech streams. In addition, participants were assigned a visual task, consisting of searching for a pink dot on the mouth of the visually-presented face. The visual face matched either the target voice or the distracter voice, and the face was either upright or inverted.
The results of the second experiment was that participants were somewhat more accurate (7%) at identifying pitch modulations when the visual face matched the target voice than when it did not.
As I understand it, the main claim of the manuscript is as follows:
For sentence comprehension in Experiment 1, both face matching (measured as the contrast of dynamic face vs. static face) and face rotation were influential.
For pitch modulation in Experiment 2, only face matching (measured as the contrast of target-stream vs. distracter-stream face) was influential.
This claim is summarized in the abstract as "Although we replicated previous findings that temporal coherence induces binding, there was no evidence for a role of linguistic cues in binding. Our results suggest that temporal cues improve speech processing through binding and linguistic cues benefit listeners through late integration."
The claim for Experiment 2 is that face rotation was not influential. However, the authors provide no evidence to support this assertion, other than visual inspection (page 15, line 235):
"However, there was no difference in the benefit due to the target face between the upright and inverted condition, and therefore no benefit of the upright face (Figure 2C)."
In fact, the data provided suggests that the opposite may be true, as the improvement for upright faces (t=6.6) was larger than the improvement for inverted faces (t=3.9).
An appropriate analysis to test this assertion would be to construct a linear mixed-effects model with fixed factors of face inversion and face matching, and then examine the interaction between these factors.
However, even if this analysis was conducted and the interaction was non-significant, that would not necessarily be strong support for the claim. As the canard has it, "absence of evidence is not evidence of absence". The problem here is that the effect is rather small (7% for face matching). Trying to find significant differences of face inversion within the range of the 7% effect of face matching is difficult but would likely be possible given a larger sample size, assuming that the effect size found with the current sample size holds (t = 6.6 vs. t = 3.9).
In contrast, in experiment 1, the range is very large (improvement from ~40% for the static face to ~90% for dynamic face) making it much easier to find a significant effect of inversion.
One null model would be to assume that the proportional difference in accuracy due to inversion is similar for speech perception and pitch modulation (within the face matching effect) and predict the difference. In experiment 1, inverting the face at 0 dB reduced accuracy from ~90% to ~80%, a ~10% decrease. Applying this to the 7% range found in Experiment 2 would predict that inverted accuracy would be ~6.3% vs. 7%. The authors could perform a power calculation to determine the necessary sample size to detect an effect of this magnitude.
Other Comments
When reporting the effects of linear effects models or other regression models, it is important to report the magnitude of the effect, measured as the actual values of the model coefficients. This allows readers to understand the relative amplitude of different factors on a common scale. For experiment 1, the only values provided are imputed statistical significance, which are not good measures of effect size.
The duration of the pitch modulations in Experiment 2 are not clear. It would help the reader to provide a supplemental figure showing the speech envelope of the 4 simultaneous speech streams and the location and duration of the pitch modulations in the target and distracter streams.
If the pitch modulations were brief, it should be possible to calculate reaction time as an additional dependent measure. If the pitch modulations in the target and distracter streams occurred at different times, this would also allow more accurate categorization of the responses as correct or incorrect by creation of a response window. For instance, if a pitch modulation occurred in both streams and the participant responded "yes", then the timing of the pitch modulation and the response could dissociate a false-positive to the distractor stream pitch modulation from the target stream pitch modulation.
It is not clear from the Methods, but it seems that the results shown are only for trials in which a single distracter was presented in the target stream. A standard analysis would be to use signal detection theory to examine response patterns across all of the different conditions.
In selective attention experiments, the stimulus is usually identical between conditions while only the task instructions vary. The stimulus and task are both different between experiments 1 and 2, making it difficult to claim that "linguistic" vs. "temporal" is the only difference between the experiments.
At a more conceptual level, it seems problematic to assume that inverting the face dissociates linguistic from temporal processing. For instance, a computer face recognition algorithm whose only job was to measure the timing of mouth movements (temporal processing) might operate by first identifying the face using eye-nose-mouth in vertical order. Inverting the face would disrupt the algorithm and hence "temporal processing", invalidation the assumption that face inversion is a pure manipulation of "linguistic processing".