52 Matching Annotations
  1. Jun 2022
    1. fs?zbmn ̄»z'sfm?eqxe ́zXw]qkrme « f quz'afcK|~zbjf»kvykreeghukrfOqeu_ ° z b|rà_'

      epsilon greedy

    2. ¿ÑÈÕ®©ª z 'hua»z=Ŧßà Ò Ñ È Õ  ®©• œ ‘z'hua»z=Ŧßà Ò Ñ È  ˆ©!• œ  =ä8&Xˆ©f‘z'hua»z=Ŧßà Ò Ñ È  ˆ©!•

      the crux

    3. _shz'm]w±KÅQfO ̄à ̧³o*Մz'm ̄à

      so rewards, and reward functions can be defined over S x A x S

  2. May 2022
    1. Will they make exactly the same action selections and weightupdates?

      no, on Q-learning the greedy action in the bellman equation is taken before the update, but the next step's action is generated from the updated Q.

      whereas in SARSA with a greedy policy, the same greedy action is used in the update equation, and it is also used taken to generate the next state

    2. Q-learning considered an o↵-policy control method?

      because the policy for which Q estimates is the one that is greedy w.r.t Q, but the policy generating the samples can be anything

    3. ⇢ t (R t+1 + Gt+1:h )

      see 5.9, this is per-decision importance sampling

  3. Nov 2021
    1. Information Criterion

      usually a function of the likelihood function plus some penalty for model complexity

    2. sum of t independent innovations

    3. 1 + θ21 + θ22 + ···+ θ2q

      due to having uncorrelated innovations

    4. ex number λ: |λ|> 1,(1 − 1λL)−1

      applying the inverse, renders the process an infinite order MA process

    5. z |≤1}, i.e., |λj |> 1

      geometric series

    6. Li (ηt )

      function on eta_t

    7. ηt−

      how does this affect the process over time?

      e.g. changing interest rates on the long term economy

    8. p/n →0

      more data than parameters

    9. St

      S_t is a weighted average of white noise

  4. Jul 2021
    1. ✓✓+↵✓Irln⇡(A|S,✓)

      actor-critic with state value baseline update, with discounting!

      del ln(A|S, theta) is actually CrossEntropy

  5. May 2021
    1. (δt+θ>tφt−θ>t−1φt)

      modified TD error in terms of regular TD error

    2. αφtδ′t

      A_{t+1}^t = I

    3. At−10θ0+αt−1∑i=0At−1i+1φiGλ|ti

      we've achieved something special here - the recursive definition of theta_{t+1} on theta_t

    4. = (I−αφtφ>t)θt+αt−1∑i=0Ati+1φi(γλ)t−iδ′t+αφt(Rt+1+γθt>φt+1)

      this is already computationally quite nice, but these jokers want to incorporate the last term into the modified TD error

    5. Gλ|t+1

      lambda return = lambda weighted sum of all n-step returns up to time t+1

    6. γλet−1+φt−αγλ(e>t−1φt)φt

      the dutch trace update rule

    1. G1:3

      weighted sum of all available n-step lambda returns from t=1

    2. hanging the values of those past states for when theyoccur again in the future

      assuming the special linear case, with a discrete state space and where the feature vector is a one-hot encoding, the TD-error multipled by the eligbility vector is the effect of the current TD-error on each state, with the effect amplified (or de-amplified) by each states' recency of occurence.

    3. assign it backward to each prior state according to how much that state contributedto the current eligibility trace at that time.

      the current TD error contributes less and less to those states that occur furhter back in time (using the linear case helps, where the eligibility trace is a sum of past, fading state input vectors).

    4. Tt1Gt

      sum of all remaining n-step returns

    5. Qt+n1(St,At)

      should this be inside the bracket??

    6. How about the change in left-side outcome from 0 to1 madein the larger walk? Do you think that made any di↵erence in the best value ofn

      yes, because rewards can propagate in from both sides now, making the optimal n shorter?

    7. Gt:t+n.=Rt+1+Rt+2+···+n1Rt+n+nQt+n1(St+n,At+n)

      G(t:t+n) needs all rewards from time t+1

    8. Qt+n(St,At).=Qt+n1(St,At)+↵[Gt:t+nQt+n1(St,At)]

      at the current time step t+n, we need the states and actions from time = t (i.e. n steps back)

    9. +n1

      use the most up-to-date version

    10. xpectation of being in a state depends only on thepolicy and the MDP transition probabilities

      hmm sure starting states can have an impact?

  6. Jan 2021
    1. [21, 39] directlyuse conventional CNN or deep belief networks (DBN)

      interesting, read!

  7. Dec 2020
    1. ↵Gtr⇡(At|St,✓t)⇡(At|St,✓t)

      notice that the multiplier of the gradient here: G_t / pi(a|s) is positive, meaning we are always going in the same direction as the gradient. using a baseline G_t - v(S_t) allows us to revers this direction if G_t is lower than the baseline

    2. Actor–Critic with Eligibility Traces (continuing), for estimating⇡✓⇡⇡⇤

      actor critic algorithm one step TD:

      actor critic algorithm one step TD

    3. (13.16)

      very similar to box 199 but without h(s)

    4. ww+↵wrˆv(S,w)

      TD(0) update

    5. Gt:t+1ˆv(St,w)

      same as REINFORCE MC baseline, but with the sampled G replaced with a bootstrapped G

    6. That is,wis a single component,w.

      constant baseline?

    7. If there is discounting (<1) itshould be treated as a form of termination, which can be done simply by includinga factor ofin the second term of

      termination because discounting by gamma is equivalent to a non-disounted case, but with termination probability gamma

  8. Oct 2020
    1. Chenget al.[92] design a multi-channel parts-aggregated deep convolutional network byintegrating the local body part features and the global full-body features in a triplet training framework

      TODO: read this and find out what the philosophy behind parts-based model is??

    2. adaptive average pooling

      what is this?

    3. Generation/Augmentation

      TODO: read

    4. Using theannotated source data in the training process of the targetdomain is beneficial for cross-dataset learning

      What? Clarify

    5. Dy-namic graph matching (DGM)

      super interesting, but hardly applicable. do rad though!

    6. Sample Rate Learning

      what

    7. Singular VectorDecomposition (SVDNet)

      seems interesting, "iteratively integrate the orthogonality constraint in CNN training"

    8. Omni-Scale Network (OSNet)

      read paper again to see if any good ideas for architecture

    9. bottleneck laye

      Bottleneck layers do a 1x1 convolution to reduce the dimensionality, before a 3x3 convolution, to save computation

      https://medium.com/@erikgaas/resnet-torchvision-bottlenecks-and-layers-not-as-they-seem-145620f93096

    10. Global Feature Representation Learning

      someething that came up whilst looking through papers in attention: https://arxiv.org/pdf/1709.01507.pdf squeeze-and-excitation

    11. [68]

      Parts-based paper, interesting approach

    Tags

    Annotators