On 2016 Oct 22, Lydia Maniatis commented:
Part 1
This paper is all too similar to a large proportion of the vision literature, in which fussy computations thinly veil a hollow theoretical core, comprised of indefensible hypotheses asserted as fact (and thus implicitly requiring no justification), sometimes supported by citations that only weakly support them, if at all. The casual yet effective (from a publication point of view) fashion in which many authors assert popular (even if long debunked) fallacies and conjure up other pretexts for what are, in fact, mere measurements without actual or potential theoretical value is well on display here.
What is surprising in, perhaps, every case, is the willful empirical agnosia and lack of common sense, on every level – general purpose, method, data analysis - necessary to enable such studies to be conducted and published. A superficial computational complexity adds insult to injury, as many readers may wrongly feel they are not competent to understand and evaluate the validity of a study whose terms and procedures are so layered, opaque and jargony. However, the math is a distraction.
Unjustified and/or empirically false assumptions and procedures occur, as mentioned, at every level. I discuss some of the more serious ones below (this is the first of a series of comments on this paper).
- Misleading, theoretically and practically untenable, definitions of “3D tilt” (and other variables).
The terms slant and tilt naturally refer to a geometrical characteristic of a physical plane or volume (relative to a reference plane). The first sentence of Burge et al’s abstract gives the impression that we are talking about tilt of surfaces: “Estimating 3D surface orientation (slant and tilt) is an important first step toward estimating 3D shape. Here, we examine how three local image cues …should be combined to estimate 3D tilt in natural scenes.” As it turns out, the authors perform a semantic but theoretically pregnant sleight of hand in the switch from the phrase “3D surface orientation (slant and tilt)” to the phrase “3D tilt” (which is also used in the title).
The obvious inference from the context is that the latter is a mere short-hand for the former. But it is not. In fact, as the authors’ finally reveal on p. 3 of their introduction, their procedure for estimating what they call “3D tilt” does not allow them to correlate their results to tilt of surfaces: “Our analysis does not distinguish between the tilt of surfaces belonging to individual objects and the tilt (i.e. orientation [which earlier was equated with “slant and tilt”]) of depth discontinuities…We therefore emphasize that our analysis is best thought of as 3D tilt rather than 3D surface tilt estimation.”
“3D tilt” is, in effect, a conceptually incoherent term made up to coincide with the (unrationalised) procedure used to arrive at certain measures given this label. I find the description of the procedure opaque, but as I am able to understand it, small patches of images are selected, and processed to produce “3D tilt” values based on range values collected by a range finder within that region of space. The readings within the region can be from one, two, three, four, or any number of different surfaces or objects; the method does not discriminate among these cases. In other words, these local “3D tilt values” have no necessary relationship to tilt of surfaces (let alone tilt of objects, which is more relevant (to be discussed) and which the authors don’t address even nominally). We are talking about a paradoxically abstract, disembodied definition of “3D tilt.”
As a reader, being asked to “think” of the measurements as representing “3D tilt” rather than “3D surface tilt” doesn’t help me understand either how this term relates, in any useful or principled way, to the actual physical structure of the world, nor to the visual process that represents this world. The idea that measuring this kind of “tilt” could be useful to forming a representation of the physical environment, and that the visual system might have evolved a way to estimate these intrinsically random and incidental values, is an idea that seems invalid on its face - and the authors make no case for it.
They then proceed to measure 3 other home-cooked variables, in order to search for possible correlations between these and “3D tilt.” These variables are also chosen arbitrarily, i.e. in the absence of a theoretical rationale, based on: “simplicity, historical precedence, and plausibility given known processing in the early visual system.” (p. 2). Simplicity is not, by itself, a rationale – it has to have a rational basis. At first glance, at least the third of these reasons would seem to constitute a shadow of a theoretical rationale, but it is based on sparse, premature and over-interpreted physiological data primarily of V1 neuron activity. Furthermore, the authors’ definitions of their three putative cues: disparity gradient, luminance gradient, texture gradient, are very particular, assumption-laden, paradoxical, and unrationalised.
For example, the measure of “texture orientation” involves the assumption that textures are generally composed of “isotropic [i.e. circular] elements” (p. 8). This assumption is unwarranted to begin with. Given, furthermore, that the authors’ measures at no point involve parsing the “locations” measured into figures and grounds, it is difficult to understand what they can mean by the term “texture element.” Like tilt, reference to an “isotropic texture element” implies a bounded, discrete area of space with certain geometric characteristics and relationships. It makes no sense to apply it to an arbitrary set of pixel luminances.
Also, as in the case of “3D tilt” the definition of “texture gradient” is both arbitrary and superficially complex: “we define [the dominant orientation of the image texture] in the Fourier domain. First, we subtract the mean luminance and multiply by (window with) the Gaussian kernel above centered on (x, y). We then take the Fourier transform of the windowed image and comute the amplitude spectrum. Finally, we use singular value decomposition ….” One, two, three….but WHY did you make these choices? Simplicity, historical precedence, Hubel and Wiesel…?
If, serendipitously, the authors’ choices of things to measure and compare had led to high correlations, they might have been justified in sharing them. But as it turns out, not surprisingly, the correlations between “cues” and “tilt” are “typically not very accurate.” Certain (unpredicted) particularities of the data which to which the authors speculatively attribute theoretical value (incidentally undermining one of their major premises) will be discussed later.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.