34 Matching Annotations
  1. Jun 2016
  2. Mar 2015
    1. only training for 1 epoch or even less

      only training for 1 epoch or even less .. so we check only in several layers of the network or maybe for all of them during the first epoch?

    2. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10.

      That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10.

      The word random is repeated?

    3. validation/training accuracy

      I have usually encountered the use of error instead of accuracy. Normally found when discussion the bias-variance trade-off. Seems to be more intuitive to me. Maybe we can have an equivalent error graph on the opposite side of the accuracy graph?

    4. Therefore, a better solution might be to force a particular random seed before evaluating

      Don't understand. What is the random seed used for? Selecting drop-out nodes whose back prop will be checked?

    5. If your gradcheck for only ~2 or 3 datapoints then you will almost certainly gradcheck for an entire batch.

      Just to confirm: if I am using a batch of 10 data points to update the gradient, I need only 2 to 3 of those data points. And this is true irrespective of the size of the batch?

    6. combine the parameters into a single large parameter vector

      The documentation talks of weights and parameters. I assume in this case the parameters are the weights. Maybe reinforce this by adding in parenthesis the word weights? Helps us differentiate between the weight matrix and the hyper-parameters.

    1. U1 = np.random.rand(*H1.shape) < p

      How does this work? Does it set all elements of some randomly selected rows of the weight matrix to 0?

    2. This is motivated by based on a compromise and an equivalent analysis

      Typo/grammar: The motivation for this is based on a compromise and an equivalent analysis

    3. This turns out to be a mistake,

      I think SGD will also fail for 0 values because any change in the gradient will be multiplied by these zeros and will therefore never change. Is this thinking correct?

    4. with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative

      Can anyone explain why?

    1. Notice that this is the gradient only with respect to the row of W that corresponds to the correct class. For the other rows where j≠yi the gradient is:

      I am at a loss here (no pun intended). So for a given class \(y_i\) I only calculate the gradient for all those \(L_i\) that are labelled with \(y_i\). I assume I have to do this for all \(y_i\). So what do I use the expression below for?

      TIA.

    1. The final loss for this example is 1.58 for the SVM and 0.452 for the Softmax classifier

      The figure above has a value of 1.04 for the softmax case. I think that should be \(0.452\).

    1. The synapses are not just a single weight a complex non-linear dynamical system

      Typo. Grammar.

      The synapses are not just a single weight ,but a complex non-linear dynamical system