Hypothesis

34 Matching Annotations

May 2021
arxiv.org arxiv.org

1711.11248.pdf

1
1. dominik.lewy 26 May 2021
  
  in Public
  
  The second spatiotemporal variant isa “(2+1)D” convolutional block, which explicitly factorizes3D convolution into two separate and successive operations,a 2D spatial convolution and a 1D temporal convolution.
  
  More nonlinearites and easier optimization task
  
  action_recognition r3d_18 CV_for_HC
Visit annotations in context

Tags

action_recognition

CV_for_HC

r3d_18

Annotators

dominik.lewy

URL

arxiv.org/pdf/1711.11248.pdf
Mar 2021
arxiv.org arxiv.org

1711.11248.pdf

2
1. dominik.lewy 17 Mar 2021
  
  in Public
  
  Regardless of the size of output pro-duced by the last convolutional layer, each network appliesglobal spatiotemporal average pooling to the final convolu-tional tensor, followed by a fully-connected (fc) layer per-forming the final classification (the output dimension of thefc layer matches the number of classes, e.g.,400for Kinet-ics).
  
  action_recognition r3d_18 CV_for_HC
2. dominik.lewy 17 Mar 2021
  
  in Public
  
  The first formulation is named mixed con-volution (MC) and consists in employing 3D convolutionsonly in the early layers of the network, with 2D convolu-tions in the top layers.
  
  r3d_18 CV_for_HC action_recognition
Visit annotations in context

Tags

action_recognition

r3d_18

CV_for_HC

Annotators

dominik.lewy

URL

arxiv.org/pdf/1711.11248.pdf
Feb 2021
arxiv.org arxiv.org

1705.07750.pdf

6
1. AnnaTum 03 Feb 2021
  
  in Public
  
  A recent extension [8] fuses the spatial and flow streamsafter the last network convolutional layer, showing someimprovement on HMDB while requiring less test time aug-mentation (snapshot sampling). Our implementation fol-lows this paper approximately using Inception-V1. The in-puts to the network are 5 consecutive RGB frames sam-pled 10 frames apart, as well as the corresponding opticalflow snippets. The spatial and motion features before thelast average pooling layer of Inception-V1 (5×7×7fea-ture grids, corresponding to time, x and y dimensions) arepassed through a3×3×33D convolutional layer with 512output channels, followed by a3×3×33D max-poolinglayer and through a final fully connected layer. The weightsof these new layers are initialized with Gaussian noise
  
  Two-Stream Networks
  
  action_recognition
2. AnnaTum 03 Feb 2021
  
  in Public
  
  For this paper we implemented a small variation of C3D[31], which has8convolutional layers,5pooling layers and2fully connected layers at the top. The inputs to the modelare short16-frame clips with112×112-pixel crops as inthe original implementation. Differently from [31] we usedbatch normalization after all convolutional and fully con-nected layers. Another difference to the original model isin the first pooling layer, we use a temporal stride of2in-stead of1, which reduces the memory footprint and allowsfor bigger batches – this was important for batch normal-ization (especially after the fully connected layers, wherethere is no weight tying). Using this stride we were able totrain with 15 videos per batch per GPU using standard K40GPUs
  
  C3D
  
  action_recognition
3. AnnaTum 03 Feb 2021
  
  in Public
  
  The model is trained using cross-entropy losses on theoutputs at all time steps. During testing we consider onlythe output on the last frame. Input video frames are sub-sampled by keeping one out of every 5, from an original 25frames-per-second stream. The full temporal footprint of allmodels is given in table 1
  
  ConvNet+ LSTM
  
  action_recognition
4. AnnaTum 03 Feb 2021
  
  in Public
  
  In this paper we compare and study a subset of modelsthat span most of this space. Among 2D ConvNet meth-ods, we consider ConvNets with LSTMs on top [5, 37], andtwo-stream networks with two different types of stream fu-sion [8, 27]. We also consider a 3D ConvNet [14, 30]: C3D
  
  comparison
  
  action_recognition
5. AnnaTum 03 Feb 2021
  
  in Public
  
  The modeltermed a “Two-Stream Inflated 3D ConvNets” (I3D), buildsupon state-of-the-art image classification architectures, butinflatestheir filters and pooling kernels (and optionally theirparameters) into 3D, leading to very deep, naturally spatio-temporal classifiers. An I3D model based on Inception-v1 [13] obtains performance far exceeding the state-of-the-art, after pre-training on Kinetics
  
  concept
  
  action_recognition
6. AnnaTum 03 Feb 2021
  
  in Public
  
  Our experimental strategy is to reimplement a number ofrepresentative neural network architectures from the litera-ture, and then analyze their transfer behavior by first pre-training each one on Kinetics and then fine-tuning each onHMDB-51 and UCF-101.
  
  Initial goal of the experiment
  
  action_recognition
Visit annotations in context

Tags

action_recognition

Annotators

AnnaTum

URL

arxiv.org/pdf/1705.07750.pdf
arxiv.org arxiv.org

1904.02811.pdf

6
1. AnnaTum 01 Feb 2021
  
  in Public
  
  we train and evaluate models with clips of 8 frames (T= 8)by skipping every other frame (all videos are pre-processedto 30fps, so the newly-formed clips are effectively at 15fps)
  
  augmentation
  
  action_recognition
2. AnnaTum 01 Feb 2021
  
  in Public
  
  Base architecture. We useResNet3D, presented in Table 1,as our base architecture for most of our ablation experi-ments in this section. More specifically, our model takesclips with a size of T×224×224 whereT= 8is the num-ber of frames,224is the height and width of the croppedframe. Two spatial downsampling layers (1×2×2) are ap-plied atconv1and atpool1, and three spatiotemporaldownsampling (2×2×2) are applied atconv31,conv41andconv51 via convolutional striding. A global spa-tiotemporal average pooling with kernel sizeT8×7×7 is ap-plied to the final convolutional tensor, followed by a fully-connected (fc) layer performing the final classification
  
  260K videos
  
  action_recognition
3. AnnaTum 01 Feb 2021
  
  in Public
  
  Interaction-reducedchannel-separatedbottleneckblockis derived from the preserved bottleneck block byremoving the extra 1×1×1 convolution. This yields thedepthwise bottleneck block shown in Figure 2(c). Notethat the initial and final 1×1×1 convolutions (usually inter-preted respectively as projecting into a lower-dimensionalsubspace and then projecting back to the original dimen-sionality) are now the only mechanism left for channelinteractions. This implies that the complete block shown in(c) has a reduced number of channel interactions comparedwith those shown in (a) or (b). We call this design aninteraction-reducedchannel-separated bottleneck blockand the resulting architecture aninteraction-reducedchannel-separated network(ir-CSN).
  
  interaction-reduced channel-separated block
  
  action_recognition
4. AnnaTum 01 Feb 2021
  
  in Public
  
  Interaction-preservedchannel-separatedbottleneckblockis obtained from the standard bottleneck block (Fig-ure 2(a) by replacing the 3×3×3 convolution in (a) witha 1×1×1 traditional convolution and a 3×3×3 depthwiseconvolution (shown in Figure 2(b)). This block reducesparameters and FLOPs of the traditional 3×3×3 convo-lution significantly, but preserves all channel interactionsvia a newly-added 1×1×1 convolution. We call this aninteraction-preservedchannel-separated bottleneck blockand the resulting architecture aninteraction-preservedchannel-separated network(ip-CSN).
  
  interaction-preserved channel-separated network
  
  action_recognition
5. AnnaTum 01 Feb 2021
  
  in Public
  
  Thesereductions occur because each filter in a group receives in-put from only a fraction1/Gof the channels from the pre-vious layer. In other words, channel grouping restricts fea-ture interaction: only channels within a group can inter-act.
  
  reductions by grouping
  
  action_recognition
6. AnnaTum 01 Feb 2021
  
  in Public
  
  Conventional convolution is imple-mented with dense connections, i.e., each convolutional fil-ter receives input from all channels of its previous layer, asin Figure 1(a). However, in order to reduce the computa-tional cost and model size, these connections can be sparsi-fied by grouping convolutional filters into subsets.
  
  conventional convolution
  
  action_recognition
Visit annotations in context

Tags

action_recognition

Annotators

AnnaTum

URL

arxiv.org/pdf/1904.02811.pdf
Jan 2021
arxiv.org arxiv.org

1904.02811.pdf

2
1. AnnaTum 30 Jan 2021
  
  in Public
  
  ARTNet [34] decouples spatial andtemporal modeling into two parallel branches. Similarly,3D convolutions can also be decomposed into a Pseudo-3Dconvolutional block as in P3D [25] or factorized convolu-tions as in R(2+1)D [32] or S3D [40]. 3D group convolutionwas also applied to video classification in ResNeXt [16] andMulti-Fiber Networks [5] (MFNet)
  
  decomposition of model
  
  action_recognition
2. AnnaTum 30 Jan 2021
  
  in Public
  
  P3D [25],R(2+1)D [32], and S3D [40]. In these architectures, a 3Dconvolution is replaced with a 2D convolution (in space)followed by a 1D convolution (in time). This factoriza-tion can be leveraged to increase accuracy and/or to reducecomputation.
  
  3D convolution architectures
  
  action_recognition
Visit annotations in context

Tags

action_recognition

Annotators

AnnaTum

URL

arxiv.org/pdf/1904.02811.pdf
arxiv.org arxiv.org

1812.03982.pdf

4
1. dominik.lewy 29 Jan 2021
  
  in Public
  
  A typical value ofτwe studied is16—this refreshing speed is roughly 2 frames sampled persecond for 30-fps videos.
  
  At what frequency is the data sampled for Slow pathway
  
  action_recognition i3d CV_for_HC
2. dominik.lewy 29 Jan 2021
  
  in Public
  
  his method has been a foundation of manycompetitive results in the literature [12,13,55].
  
  Reference to v1b. This method does not use separate preprocessing in form of Optical Flow calculation as the network presented in v1b.
  
  action_recognition i3d CV_for_HC
3. dominik.lewy 29 Jan 2021
  
  in Public
  
  One path-way is designed to capture semantic information that can begiven by images or a few sparse frames, and it operates atlowframe rates andslowrefreshing speed. In contrast, theother pathway is responsible for capturing rapidly changingmotion, by operating atfastrefreshing speed and high tem-poral resolution. Despite its high temporal rate, this pathwayis made verylightweight,e.g.,∼20% of total computation.This is because this pathway is designed to have fewer chan-nels and weaker ability to process spatial information, whilesuch information can be provided by the first pathway in aless redundant manner.
  
  Difference between pathways and computational complexity as % of total.
  
  action_recognition CV_for_HC i3d
4. dominik.lewy 29 Jan 2021
  
  in Public
  
  a Slow pathway, operating at low framerate, to capture spatial semantics, and (ii) a Fast path-way, operating at high frame rate, to capture motion atfine temporal resolution.
  
  Motivation.
  
  action_recognition CV_for_HC i3d
Visit annotations in context

Tags

action_recognition

i3d

CV_for_HC

Annotators

dominik.lewy

URL

arxiv.org/pdf/1812.03982.pdf
arxiv.org arxiv.org

1608.00859.pdf

11
1. dominik.lewy 26 Jan 2021
  
  in Public
  
  For the extraction ofoptical flow and warped optical flow, we choose the TVL1 optical flow algorithm[35] implemented in OpenCV with CUDA.
  
  Optical flow algorithm used. This one is required for the temporal CNN.
  
  action_recognition CV_for_HC v1b
2. dominik.lewy 26 Jan 2021
  
  in Public
  
  We use the mini-batch stochastic gradient descent algorithm to learn the net-work parameters, where the batch size is set to 256 and momentum set to 0.9.We initialize network weights with pre-trained models from ImageNet [33].
  
  Hyperparameters and weights initialization.
  
  action_recognition CV_for_HC v1b
3. dominik.lewy 26 Jan 2021
  
  in Public
  
  Data Augmentation.Data augmentation can generate diverse training sam-ples and prevent severe over-fitting. In the original two-stream ConvNets, ran-dom cropping and horizontal flipping are employed to augment training samples.We exploit two new data augmentation techniques: corner cropping and scale-jittering.
  
  Traditional data augmentation techniques can be used for two stream architectures.
  
  action_recognition CV_for_HC v1b
4. dominik.lewy 26 Jan 2021
  
  in Public
  
  Network Inputs.We are also interested in exploring more input modalitiesto enhance the discriminative power of temporal segment networks. Originally,the two-stream ConvNets used RGB images for the spatial stream and stackedoptical flow fields for the temporal stream.
  
  Standard data input format for a 2 stream architecture is build of: RGB image and stacked optical flow.
  
  action_recognition CV_for_HC v1b
5. dominik.lewy 26 Jan 2021
  
  in Public
  
  Here a class scoreGiis inferred from the scores of thesame class on all the snippets, using an aggregation functiong. We empiricallyevaluated several different forms of the aggregation functiong, including evenlyaveraging, maximum, and weighted averaging in our experiments. Among them,evenly averaging is used to report our final recognition accuracies.
  
  How is the result aggregated from segment level to movie level.
  
  action_recognition CV_for_HC v1b
6. dominik.lewy 26 Jan 2021
  
  in Public
  
  In experiments, the number of snippetsKis set to 3 according to previousworks on temporal modeling [16,17].
  
  The paper suggest 3 segment, the implementation in Gluon CV already has 7. The question that we should ask is how long should be the video clip? This should be the input to data loader.
  
  action_recognition CV_for_HC v1b
7. dominik.lewy 26 Jan 2021
  
  in Public
  
  Temporal segment networ
  
  Visualization of the 2 stream (Spatial CNN and Temporal CNN) architecture
  
  action_recognition CV_for_HC v1b
8. dominik.lewy 26 Jan 2021
  
  in Public
  
  Our first contribution is temporal segment net-work (TSN), a novel framework for video-based action recognition. whichis based on the idea of long-range temporal structure modeling
  
  Main contribution
  
  action_recognition CV_for_HC v1b
9. dominik.lewy 26 Jan 2021
  
  in Public
  
  However, mainstream ConvNet frameworks [1,13] usually focus on appearancesand short-term motions, thus lacking the capacity to incorporate long-rangetemporal structure. Recently there are a few attempts [19,4,20] to deal withthis problem. These methods mostly rely on dense temporal sampling with apre-defined sampling interval. This approach would incur excessive computa-tional cost when applied to long video sequences, which limits its application inreal-world practice and poses a risk of missing important information for videoslonger than the maximal sequence length.
  
  Historical approach using a predefine sequence length
  
  action_recognition CV_for_HC v1b
10. dominik.lewy 26 Jan 2021
  
  in Public
  
  In terms of temporal structure modeling, a key observation is thatconsecutive frames are highly redundant. Therefore, dense temporal sampling,which usually results in highly similar sampled frames, is unnecessary. Instead asparse temporal sampling strategy will be more favorable in this case. Motivatedby this observation, we develop a video-level framework, calledtemporal segmentnetwork(TSN). This framework extracts short snippets over a long video se-quence with a sparse sampling scheme, where the samples distribute uniformlyalong the temporal dimension.
  
  Confirms the intuition towards sparse sampling
  
  action_recognition CV_for_HC v1b
11. dominik.lewy 26 Jan 2021
  
  in Public
  
  Limited by computational cost these methodsusually process sequences of fixed lengths ranging from 64 to 120 frames
  
  Number of frames processed by older approaches
  
  CV_for_HC action_recognition v1b
Visit annotations in context

Tags

action_recognition

v1b

CV_for_HC

Annotators

dominik.lewy

URL

arxiv.org/pdf/1608.00859.pdf
arxiv.org arxiv.org

()

1
1. dominik.lewy 25 Jan 2021
  
  in Public
  
  2015
  
  action_recognition
Visit annotations in context

Tags

action_recognition

Annotators

dominik.lewy

URL

arxiv.org/pdf/1507.02159.pdf
towardsdatascience.com towardsdatascience.com

Deep Learning Architectures for Action Recognition

1
1. dominik.lewy 25 Jan 2021
  
  in Public
  
  Nice summary of action recognition architectures.
  
  action_recognition
Visit annotations in context

Tags

action_recognition

Annotators

dominik.lewy

URL

towardsdatascience.com/deep-learning-architectures-for-action-recognition-83e5061ddf90

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL