Hypothesis

2 Matching Annotations

Nov 2021
lilianweng.github.io lilianweng.github.io

How to Train Really Large Models on Many GPUs?

2
1. sherlockliao 09 Nov 2021
  
  in Public
  
  two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).
  
  深度网络训练中的显存开销主要是哪些？
  
  GPU memory amp
2. sherlockliao 09 Nov 2021
  
  in Public
  
  It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.
  
  ZeRO-DP 的原理是什么？
  
  ZeRO GPU memory Data Parallel
Visit annotations in context

Tags

amp

GPU memory

ZeRO

Data Parallel

Annotators

sherlockliao

URL

lilianweng.github.io/lil-log/2021/09/24/train-large-neural-networks.html