A recent extension [8] fuses the spatial and flow streamsafter the last network convolutional layer, showing someimprovement on HMDB while requiring less test time aug-mentation (snapshot sampling). Our implementation fol-lows this paper approximately using Inception-V1. The in-puts to the network are 5 consecutive RGB frames sam-pled 10 frames apart, as well as the corresponding opticalflow snippets. The spatial and motion features before thelast average pooling layer of Inception-V1 (5×7×7fea-ture grids, corresponding to time, x and y dimensions) arepassed through a3×3×33D convolutional layer with 512output channels, followed by a3×3×33D max-poolinglayer and through a final fully connected layer. The weightsof these new layers are initialized with Gaussian noise
Two-Stream Networks