Inter-node communication stalls: high batching is crucial to profitably serve millions of users, and in the context of SOTA reasoning models, many nodes are often required. Inference workloads then resemble more training.
Oh, so to get the highest throughout, the inference servers also batch operations making it look a bit like training too