The hyperparameter search space is summarized in Table 1, with full results in Table 2. While no single configuration is universally optimal, we highlight a setting with block_size=4, fetch_factor=16, and num_workers=12, which achieves approximately 2593 samples/sec and maintains an entropy of 3.59—comparable to random sampling.
This is a powerful tool that allows everyone to train on large datasets, thank you for sharing it with the community! Do you have any practical sense about the tradeoffs between minibatch entropy and model validation performance for a set amount of training time? Obviously this is an impossible experiment to actually run, but I wonder if even lower minibatch entropy which allows higher throughput would be ideal given a set training time. Do you have any anecdotal evidence from training runs on how much shuffling is optimal? I agree that since this experiment could not actually be performed, close to random shuffling is probably the best. Thank you for this contribution!