Reviewer #2 (Public Review):  
Luongo et al. investigated the behavioural ability of 4 different species (macaque, mouse lemur, tree shrew and mouse) to segment figures defined by opponent motion, as well as different visual features from the background. With carefully designed experiments they convincingly make the point that figures that are not defined by textural elements (orientation or phase offsets, thus visible in a still frame) but purely by motion contrast, could not be detected by non-primate species. Interestingly it appears to be particularly motion contrast, since pure motion - figures moving on a static background - could be discriminated better, at least by mice.  
This is highly interesting and surprising -- especially for a tree shrew, a diurnal, arboreal mammal, very closely related to primates and with a highly evolved visual system. It is also an important difference to take into account considering the multitude of studies on the mouse visual system in recent years.  
The authors additionally present neuronal activity in mice, from three different visual cortical areas recorded with both electrophysiology and imaging. Their conclusions are mostly supported by the data, but some aspects of the recordings and data analysis need to be clarified and extended.  
The main issues are outlined below roughly in order of importance:  
1. The most worrying aspect is that, if I interpret their figures correctly, their recordings seem not very stable and this may account for many of the differences across the visual conditions. The authors do not report in which order the different stimuli were shown, their supplemental movie, however, makes it seem as though they were not recorded fully interleaved, but potentially in a block design with all cross1 positions recorded first, before switching to cross2 positions and then on to iso... If I interpret Figure 6a correctly, each line is the same neuron and the gray scale shows the average response rate for each condition. Many of these neurons, however, show a large change in activity between the cross1 and the cross2 block. Much larger than the variability within each block that should be due to figure location and orientation tuning. If this interpretation is correct, this would mean that either there were significant brain state changes (they do have the mice on a ball but don't report whether and how much the animals were moving) between the blocks or their recordings could be unstable in time. It would be good to know whether similar dramatic changes in overall activity level occur between the blocks also in their imaging data.  
The same might be true for differences in the maps between conditions in figure 4. If indeed the recordings were in blocks and some cells stopped responding, this could explain the low map similarities. For example Cell 1 for the cross stimuli seems to be a simple ON cell, almost like their idealized cell in 3d. However, even though the exact texture in the RF and large parts of the surround for a large part of the locations is exactly identical for Cross1 and Iso2, as well as Cross2 and Iso1, the cells responses for both iso conditions appear to only be noise, or at least extremely noise dominated. Why would the cell not respond in a phase or luminance dependent manner here?  
This could either be due to very high surround suppression in the iso condition (which cannot be judged within condition normalization) or because the cell simply responded much weaker due to recording instability or brain state changes. Without any evidence of significant visual responses, enough spikes in each condition and a stable recording across all blocks, this data is not really interpretable. Instability or generally lower firing rates could easily also explain differences in their decoding accuracy.  
Similarly, it is very hard to judge the quality of their imaging data. They show no example field of views or calcium response traces and never directly compare this data to their electrophysiology data. It is mentioned that the imaging data is noisy and qualitatively similar, but some quantification could help convince the reader. Even if noisy, it is puzzling that the decoding accuracy should be so much worse with the imaging data: Even with ten times more included neurons, accuracy still does not even reach 30% of that of the ephys data. This could point to very poor data quality.  
2. There is no information on the recorded units given. Were they spike sorted? Did they try to distinguish fast spiking and regular spiking units? What layers were they recorded from? It is well known that there are large laminar differences in the strength of figure ground modulation, as well as orientation tuned surround suppression. If most of their data would be from layer 5, perhaps a lack of clear figure modulation might not be that surprising. This could perhaps also be seen when comparing their electrophysiology data to the imaging data which is reportedly from layer 2/3, where most neurons show larger figure modulation/tuned surround suppression effects. There is, however, no report or discussion of differences in modulation between recording modalities.  
3. There is an apparent discrepancy between Figure 5d and i. How can their modulation index be around -0.1 for cross (Figure 5d) - which would correspond to on average ~20% weaker responses to a figure than to background, when their PSTH (5i) shows an almost 50% increase of figure over ground. This positive figure modulation has also been widely reported in the literature (Schnabel, Kirchberger, Keller). Are there different populations of cells going into these analyses?  
4. In a similar vein, it is not immediately clear why the average map correlation would be bigger for random cell pairs (~0.2, Fig 3g) than for the different conditions of the same cell (~0, Fig 5b). Could this be due to differences in recording modality (imaging in 3g and ephys in 5b)?  
5. The maps in Figure 4 should show the location of the RF, because they cannot be interpreted without knowledge of the RF center and size. For example cell 4 in the iso 1 condition could be a border cell, or could respond to the center of the figure. It is impossible to deduce without knowledge of the location of the RF.  
6. It could help the reader to discuss the interpretation of the map correlations in Fig 5 a and b in more detail. My guess is that negatively correlated maps (within cross or iso condition) could come from highly orientation tuned neurons, whereas higher correlation values point to more generally figure/contextually modulated cells (within this condition). While the distribution is far from bimodal, this does not rule out a population of nicely figured modulated cells at the high end of the distribution. It might not be necessary at the level of V1 that the figure modulation be consistent across all textures. It would not be surprising, if orientation contrast-defined, phase contrast-defined and motion contrast-defined figures could be signalled to higher areas by discrete populations of V1 or even LM cells.  
7. Some of the behavioural results warrant a little more explanation or discussion, as well. In Figure 2h, the mice seem significantly better on the static version of the iso task, than on the moving one. If statistically significant, this should be discussed. Is this because the static frame was maximally phase offset? Then the figure would indeed be better visible better (bigger phase contrast in more frames) than in the moving condition.  
Figure 2 and extended Figure 1c: why is the mouse lemur performing so poorly on average? It also appears to have biggest problems with the cross stimulus early on in training.  
Tree shrews seem not to be able to memorize the textures as well as the mice do. Is this because of less deprivation/motivation? Or because of the bigger set of textures in training? This would make memorization harder and could thus lower their overall performance. The comparative aspects are very interesting but the absolute differences in performance could be discussed in more detail or explained better.  
8. In Figure 7b, why wouldn't the explanation for the linear decodability in cross also hold for iso? There are phase offsets at the borders that simple cells should readily be able to resolve, just as in the case of orientation discontinuities. Could they make a surround phase model, similar to their surround orientation model, that could more readily capture the iso discontinuities?