52 Matching Annotations
  1. Jun 2022
    1. ¿ÑÈÕ®©ª z 'hua»z=Ŧßà Ò Ñ È Õ  ®©• œ ‘z'hua»z=Ŧßà Ò Ñ È  ˆ©!• œ  =ä8&Xˆ©f‘z'hua»z=Ŧßà Ò Ñ È  ˆ©!•

      the crux

  2. May 2022
    1. Will they make exactly the same action selections and weightupdates?

      no, on Q-learning the greedy action in the bellman equation is taken before the update, but the next step's action is generated from the updated Q.

      whereas in SARSA with a greedy policy, the same greedy action is used in the update equation, and it is also used taken to generate the next state

    2. Q-learning considered an o↵-policy control method?

      because the policy for which Q estimates is the one that is greedy w.r.t Q, but the policy generating the samples can be anything

  3. Nov 2021
  4. Jul 2021
  5. May 2021
    1. = (I−αφtφ>t)θt+αt−1∑i=0Ati+1φi(γλ)t−iδ′t+αφt(Rt+1+γθt>φt+1)

      this is already computationally quite nice, but these jokers want to incorporate the last term into the modified TD error

    1. hanging the values of those past states for when theyoccur again in the future

      assuming the special linear case, with a discrete state space and where the feature vector is a one-hot encoding, the TD-error multipled by the eligbility vector is the effect of the current TD-error on each state, with the effect amplified (or de-amplified) by each states' recency of occurence.

    2. assign it backward to each prior state according to how much that state contributedto the current eligibility trace at that time.

      the current TD error contributes less and less to those states that occur furhter back in time (using the linear case helps, where the eligibility trace is a sum of past, fading state input vectors).

    3. How about the change in left-side outcome from 0 to1 madein the larger walk? Do you think that made any di↵erence in the best value ofn

      yes, because rewards can propagate in from both sides now, making the optimal n shorter?

  6. Jan 2021
  7. Dec 2020
    1. ↵Gtr⇡(At|St,✓t)⇡(At|St,✓t)

      notice that the multiplier of the gradient here: G_t / pi(a|s) is positive, meaning we are always going in the same direction as the gradient. using a baseline G_t - v(S_t) allows us to revers this direction if G_t is lower than the baseline

    2. If there is discounting (<1) itshould be treated as a form of termination, which can be done simply by includinga factor ofin the second term of

      termination because discounting by gamma is equivalent to a non-disounted case, but with termination probability gamma

  8. Oct 2020
    1. Chenget al.[92] design a multi-channel parts-aggregated deep convolutional network byintegrating the local body part features and the global full-body features in a triplet training framework

      TODO: read this and find out what the philosophy behind parts-based model is??

    Tags

    Annotators