On 2016 May 26, Lydia Maniatis commented:
Part 1: This article is nonsense, a quality that unfortunately doesn't differentiate it from much (likely most) of the vision science literature today. It's particularly disconcerting that the authors are employing animals, and in particular, monkeys.
The authors are making a fundamental error in describing the visual process (see below). The error predates the period of German research known as Gestalt psychology. The latter generated cogent logical arguments and empirical evidence against this erroneous, myopic description. The modern proponents of this view are perpetually spinning their wheels, using arbitrary experimental situations to generate inconclusive data fed into ad hoc “models” built on arbitrary assumptions, perpetually adjusting these models, and treating failures and contradictions as temporary bumps in the road. This literature has become an impenetrable swamp of detail unguided by careful observations and lacking organizing concepts.
Since it involves correlations, there is obviously no point in trying to analyze perception at a neural level unless we have correctly described what it is that we are trying to correlate neural activity with. The authors are naive to the basic challenges that face the visual system as are their descriptions of “stimuli.” They describe the “stimulus” as a face, a shoe, a chair,”containing curves, straight lines, etc. Particular neurons or neural populations are supposed to be particularly sensitive to – i.e to “detect” particular features thus naively described. We're told, for example, that“Many V4 neurons” are “selectively sensitive to curvature extrema” located at “a particular position relative to the center of a closed, curved object.” Leaving aside that “a particular location relative to the center a closed, curve object” conveys no information whatsoever, there are serious theoretical and practical problems with this discussion.
First, it is a serious mistake to treat perception as a detection and information-reduction problem. The authors frame the problem as: here is a “face;” how does the visual system “detect” it, and how much of the information contained in the “face” is thrown away?
The problem is that the stimulus with which the visual system is confronted, and must use to create a percept, is never a “face,” a “shoe,” a “chair.” It is a disconnected set of points, which, moreover, are constantly changing location on the constantly-moving retinal surface. Being a sum of disconnected points (the stimulating photons), the stimulus does not intrinsically contain any type of shape or shape feature, such as lines or curves, and it certainly doesn't contain closed, curved objects. If it did, then the problem of computer vision would not be so difficult – we would just tell the computer to pick out – to match - the relevant shapes features. The problem is the computer hasn't been made that can infer/construct the absent forms in the first place. That's why reading the numbers or letters in captchas, a trivial problem for us, is so difficult for robots.
Any shapes eventually perceived are the result of what are, in effect, inferential processes. Whether and the degree to which the percept will resemble the physical object from which the photon pattern originated (which, again, the authors naively equate with the stimulus) depends on many factors. The percept might resemble it, and it might not. If an actual face, for example, is painted the same color as its background and everything is equally illuminated from all directions, then we will see a single homogeneous color instead of a face. Conversely, a flat surface may appear to possess 3D physical structure.
These simple examples are enough to show the problem, which is that photons reflected from objects don't possess the properties of these objects, shape or otherwise. Obviously, variations in intensity and wavelength of the photons are required in order for there to be a chance that the percept will resemble, in experienced structure and properties, the physical object. But this is not enough. Even luminance boundaries can't be said to intrinsically possess shape properties; we need to solve the figure-ground problem first, and the illumination/reflectance problem. And as the Gestaltists showed conclusively, non-local points influence what forms will be seen locally, whether or not these forms correspond to actual intensity changes. How these forms are constructed is a non-trivial problem that must be solved before we can refer to “a curve” or “a face.” Wilson and Wilkinson don't see these problems because they describe stimuli in terms of what they see – in terms of the problems already solved by their respective visual systems. (Again, if there were a simple correlation between what they see and the photon information, even in a simple geometrical or mathematical sense, then computers would be better at seeing by now.)
Therefore, to say that a neuron in V4 simply “detects” a particular feature that lies within its particular receptive field, and feeds this information to higher levels for combination (a summary process) with information from other such neural detectors, flies in the face of perceptual facts (and exhaustive arguments, available in the Gestalt literature, of why these claims can't hold in principle). Even more so given that we know that the activity of visual neurons even below V4 - neurons which “detect,” for example, illusory contours which are not physically present - is determined on the basis of entire visual field (and on the principles by which this field is organized).
Furthermore, as Teller (1984) pointed out, the logic behind the assumption that a higher level of firing in a particular neuron indicates a special role in “encoding” the “feature” that is supposedly being responded to, is faulty. Perceptual outcomes depend on the relative activity of neurons, not on peak activity of individual neurons. For an example she refers to cone responses and their role in coding color perception. What matters are relative, not maximum, firing rates.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.