34 Matching Annotations
  1. Jun 2016
  2. Mar 2015
    1. to beat random search in a carefully-chosen intervals.

      to beat random search in a carefully-chosen intervals.

      remove the "a"?

    2. only training for 1 epoch or even less

      only training for 1 epoch or even less .. so we check only in several layers of the network or maybe for all of them during the first epoch?

    3. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10.

      That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10.

      The word random is repeated?

    4. Tue to the denominator term in the RMSprop update

      True to the denominator term in the RMSprop update

    5. the step decay dropout is slightly

      remove the word dropout?

    6. theoretical converge guarantees

      theoretical convergence guarantees

    7. update has recently

      update that has recently

    8. set of parameters

      set of weights per network layer?

    9. model capacity

      Needs some more explaining? Reference to bias- variance trade-off? Link to VC dimension?

    10. validation/training accuracy

      I have usually encountered the use of error instead of accuracy. Normally found when discussion the bias-variance trade-off. Seems to be more intuitive to me. Maybe we can have an equivalent error graph on the opposite side of the accuracy graph?

    11. appears more as a slightly more interpretable

      appears as a slightly more interpretable (remove first more)

    12. sizes of million parameters

      can have sizes in the millions parameters can have millions of parameters

    13. Therefore, a better solution might be to force a particular random seed before evaluating

      Don't understand. What is the random seed used for? Selecting drop-out nodes whose back prop will be checked?

    14. If your gradcheck for only ~2 or 3 datapoints then you will almost certainly gradcheck for an entire batch.

      Just to confirm: if I am using a batch of 10 data points to update the gradient, I need only 2 to 3 of those data points. And this is true irrespective of the size of the batch?

    15. combine the parameters into a single large parameter vector

      The documentation talks of weights and parameters. I assume in this case the parameters are the weights. Maybe reinforce this by adding in parenthesis the word weights? Helps us differentiate between the weight matrix and the hyper-parameters.

    16. hack the code to remove the data loss contribution.

      Maybe it should be: hack the code to remove the regularization loss contribution.

    1. U1 = np.random.rand(*H1.shape) < p

      How does this work? Does it set all elements of some randomly selected rows of the weight matrix to 0?

    2. This is motivated by based on a compromise and an equivalent analysis

      Typo/grammar: The motivation for this is based on a compromise and an equivalent analysis

    3. This turns out to be a mistake,

      I think SGD will also fail for 0 values because any change in the gradient will be multiplied by these zeros and will therefore never change. Is this thinking correct?

    4. with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative

      Can anyone explain why?

    1. to to

      Typo

    2. xi scaled

      \(x_i\) is scaled

    3. Notice that this is the gradient only with respect to the row of W that corresponds to the correct class. For the other rows where j≠yi the gradient is:

      I am at a loss here (no pun intended). So for a given class \(y_i\) I only calculate the gradient for all those \(L_i\) that are labelled with \(y_i\). I assume I have to do this for all \(y_i\). So what do I use the expression below for?

      TIA.

    1. The final loss for this example is 1.58 for the SVM and 0.452 for the Softmax classifier

      The figure above has a value of 1.04 for the softmax case. I think that should be \(0.452\).

    1. The synapses are not just a single weight a complex non-linear dynamical system

      Typo. Grammar.

      The synapses are not just a single weight ,but a complex non-linear dynamical system

    2. noone

      The usual spelling is "no one" and to a lesser extent "no-one". Just nit-picking. B-)

    3. dropou

      typo: drop-out