Hypothesis

11 Matching Annotations

Jan 2021
arxiv.org arxiv.org

1608.00859.pdf

11
1. dominik.lewy 26 Jan 2021
  
  in Public
  
  For the extraction ofoptical flow and warped optical flow, we choose the TVL1 optical flow algorithm[35] implemented in OpenCV with CUDA.
  
  Optical flow algorithm used. This one is required for the temporal CNN.
  
  action_recognition CV_for_HC v1b
2. dominik.lewy 26 Jan 2021
  
  in Public
  
  We use the mini-batch stochastic gradient descent algorithm to learn the net-work parameters, where the batch size is set to 256 and momentum set to 0.9.We initialize network weights with pre-trained models from ImageNet [33].
  
  Hyperparameters and weights initialization.
  
  action_recognition CV_for_HC v1b
3. dominik.lewy 26 Jan 2021
  
  in Public
  
  Data Augmentation.Data augmentation can generate diverse training sam-ples and prevent severe over-fitting. In the original two-stream ConvNets, ran-dom cropping and horizontal flipping are employed to augment training samples.We exploit two new data augmentation techniques: corner cropping and scale-jittering.
  
  Traditional data augmentation techniques can be used for two stream architectures.
  
  action_recognition CV_for_HC v1b
4. dominik.lewy 26 Jan 2021
  
  in Public
  
  Network Inputs.We are also interested in exploring more input modalitiesto enhance the discriminative power of temporal segment networks. Originally,the two-stream ConvNets used RGB images for the spatial stream and stackedoptical flow fields for the temporal stream.
  
  Standard data input format for a 2 stream architecture is build of: RGB image and stacked optical flow.
  
  action_recognition CV_for_HC v1b
5. dominik.lewy 26 Jan 2021
  
  in Public
  
  Here a class scoreGiis inferred from the scores of thesame class on all the snippets, using an aggregation functiong. We empiricallyevaluated several different forms of the aggregation functiong, including evenlyaveraging, maximum, and weighted averaging in our experiments. Among them,evenly averaging is used to report our final recognition accuracies.
  
  How is the result aggregated from segment level to movie level.
  
  action_recognition CV_for_HC v1b
6. dominik.lewy 26 Jan 2021
  
  in Public
  
  In experiments, the number of snippetsKis set to 3 according to previousworks on temporal modeling [16,17].
  
  The paper suggest 3 segment, the implementation in Gluon CV already has 7. The question that we should ask is how long should be the video clip? This should be the input to data loader.
  
  action_recognition CV_for_HC v1b
7. dominik.lewy 26 Jan 2021
  
  in Public
  
  Temporal segment networ
  
  Visualization of the 2 stream (Spatial CNN and Temporal CNN) architecture
  
  action_recognition CV_for_HC v1b
8. dominik.lewy 26 Jan 2021
  
  in Public
  
  Our first contribution is temporal segment net-work (TSN), a novel framework for video-based action recognition. whichis based on the idea of long-range temporal structure modeling
  
  Main contribution
  
  action_recognition CV_for_HC v1b
9. dominik.lewy 26 Jan 2021
  
  in Public
  
  However, mainstream ConvNet frameworks [1,13] usually focus on appearancesand short-term motions, thus lacking the capacity to incorporate long-rangetemporal structure. Recently there are a few attempts [19,4,20] to deal withthis problem. These methods mostly rely on dense temporal sampling with apre-defined sampling interval. This approach would incur excessive computa-tional cost when applied to long video sequences, which limits its application inreal-world practice and poses a risk of missing important information for videoslonger than the maximal sequence length.
  
  Historical approach using a predefine sequence length
  
  action_recognition CV_for_HC v1b
10. dominik.lewy 26 Jan 2021
  
  in Public
  
  In terms of temporal structure modeling, a key observation is thatconsecutive frames are highly redundant. Therefore, dense temporal sampling,which usually results in highly similar sampled frames, is unnecessary. Instead asparse temporal sampling strategy will be more favorable in this case. Motivatedby this observation, we develop a video-level framework, calledtemporal segmentnetwork(TSN). This framework extracts short snippets over a long video se-quence with a sparse sampling scheme, where the samples distribute uniformlyalong the temporal dimension.
  
  Confirms the intuition towards sparse sampling
  
  action_recognition CV_for_HC v1b
11. dominik.lewy 26 Jan 2021
  
  in Public
  
  Limited by computational cost these methodsusually process sequences of fixed lengths ranging from 64 to 120 frames
  
  Number of frames processed by older approaches
  
  CV_for_HC action_recognition v1b
Visit annotations in context

Tags

CV_for_HC

v1b

action_recognition

Annotators

dominik.lewy

URL

arxiv.org/pdf/1608.00859.pdf

Tags

Annotators

URL