epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8). Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration. The Adam paper suggests: Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8 The TensorFlow documentation suggests some tuning of epsilon: The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. We can see that the popular deep learning libraries generally use the default parameters recommended by the paper. TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08. Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0. Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1. Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08 MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8 Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
Should we expose EPS as one of the experiment parameters? I think that we shouldn't since it is a rather technical parameter.


