collect 算子有两处性能隐患,一个是拉取数据过程中引入的网络开销,另一个 Driver 的 OOM(内存溢出,Out of Memory)
收集数据会导致Driver的内存占用
collect 算子有两处性能隐患,一个是拉取数据过程中引入的网络开销,另一个 Driver 的 OOM(内存溢出,Out of Memory)
收集数据会导致Driver的内存占用
suppose that GPT 4 training took 3 months in 2027 a leading AI lab will be able to train a GPT 4 00:18:19 level model in a minute
for - stat - AI evolution - prediction 2027 - training time - 6 OOM decrease
stat - AI evolution - prediction 2027 - training time - 6 OOM decrease - today it takes 3 months to train GPT 4 - in 2027, it will take 1 minute - That is, 131,400 minutes vs 1 minute, or - 6 OOM
walking bass, ching-a-ding, oom-pah
Linux Memory Management at Scale
"we had to build a complete and compliant operating system in order to perform resource control reliably"
epic real-talk. the only people on the planet who seemed to have tamed linux for workloads. controlling memory. taming io. being on the bleeding edge, it turns out, is almost entirely about forward-progress. what can we reclaim?
https://facebookmicrosites.github.io/cgroup2/docs/fbtax-results.html