 Jun 2022

people.eecs.berkeley.edu people.eecs.berkeley.edu

fs?zbmn ̄»z'sfm?eqxe ́zXw]qkrme « f quz'afcK~zbjf»kvykreeghukrfOqeu_ ° z brà_'
epsilon greedy

¿ÑÈÕ®©Âª z 'hua»z=Å¦ßà Ò Ñ È Õ ®© z'hua»z=Å¦ßà Ò Ñ È ©! =ä8&X©fz'hua»z=Å¦ßà Ò Ñ È ©!
the crux

_shz'm]w±KÅQfO ̄à ̧³o*Õz'm ̄à
so rewards, and reward functions can be defined over S x A x S

 May 2022

d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net

Will they make exactly the same action selections and weightupdates?
no, on Qlearning the greedy action in the bellman equation is taken before the update, but the next step's action is generated from the updated Q.
whereas in SARSA with a greedy policy, the same greedy action is used in the update equation, and it is also used taken to generate the next state

Qlearning considered an o↵policy control method?
because the policy for which Q estimates is the one that is greedy w.r.t Q, but the policy generating the samples can be anything

⇢ t (R t+1 + Gt+1:h )
see 5.9, this is perdecision importance sampling

 Nov 2021

ocw.mit.edu ocw.mit.edu

Information Criterion
usually a function of the likelihood function plus some penalty for model complexity

tσ
sum of t independent innovations

1 + θ21 + θ22 + ···+ θ2q
due to having uncorrelated innovations

ex number λ: λ> 1,(1 − 1λL)−1
applying the inverse, renders the process an infinite order MA process

z ≤1}, i.e., λj > 1
geometric series

Li (ηt )
function on eta_t

ηt−
how does this affect the process over time?
e.g. changing interest rates on the long term economy

p/n →0
more data than parameters

St
S_t is a weighted average of white noise

 Jul 2021

d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net

✓✓+↵✓Irln⇡(AS,✓)
actorcritic with state value baseline update, with discounting!
del ln(AS, theta) is actually CrossEntropy

 May 2021

jmlr.org jmlr.org

(δt+θ>tφt−θ>t−1φt)
modified TD error in terms of regular TD error

αφtδ′t
A_{t+1}^t = I

At−10θ0+αt−1∑i=0At−1i+1φiGλti
we've achieved something special here  the recursive definition of theta_{t+1} on theta_t

= (I−αφtφ>t)θt+αt−1∑i=0Ati+1φi(γλ)t−iδ′t+αφt(Rt+1+γθt>φt+1)
this is already computationally quite nice, but these jokers want to incorporate the last term into the modified TD error

Gλt+1
lambda return = lambda weighted sum of all nstep returns up to time t+1

γλet−1+φt−αγλ(e>t−1φt)φt
the dutch trace update rule


d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net

G1:3
weighted sum of all available nstep lambda returns from t=1

hanging the values of those past states for when theyoccur again in the future
assuming the special linear case, with a discrete state space and where the feature vector is a onehot encoding, the TDerror multipled by the eligbility vector is the effect of the current TDerror on each state, with the effect amplified (or deamplified) by each states' recency of occurence.

assign it backward to each prior state according to how much that state contributedto the current eligibility trace at that time.
the current TD error contributes less and less to those states that occur furhter back in time (using the linear case helps, where the eligibility trace is a sum of past, fading state input vectors).

Tt1Gt
sum of all remaining nstep returns

Qt+n1(St,At)
should this be inside the bracket??

How about the change in leftside outcome from 0 to1 madein the larger walk? Do you think that made any di↵erence in the best value ofn
yes, because rewards can propagate in from both sides now, making the optimal n shorter?

Gt:t+n.=Rt+1+Rt+2+···+n1Rt+n+nQt+n1(St+n,At+n)
G(t:t+n) needs all rewards from time t+1

Qt+n(St,At).=Qt+n1(St,At)+↵[Gt:t+nQt+n1(St,At)]
at the current time step t+n, we need the states and actions from time = t (i.e. n steps back)

+n1
use the most uptodate version

xpectation of being in a state depends only on thepolicy and the MDP transition probabilities
hmm sure starting states can have an impact?

 Jan 2021

arxiv.org arxiv.org

[21, 39] directlyuse conventional CNN or deep belief networks (DBN)
interesting, read!
Tags
Annotators
URL


d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net

f⌧+n<T,then:GG+nV(S⌧+n
V(terminal state) = 0

 Dec 2020

d3c33hcgiwev3.cloudfront.net d3c33hcgiwev3.cloudfront.net

↵Gtr⇡(AtSt,✓t)⇡(AtSt,✓t)
notice that the multiplier of the gradient here: G_t / pi(as) is positive, meaning we are always going in the same direction as the gradient. using a baseline G_t  v(S_t) allows us to revers this direction if G_t is lower than the baseline

Actor–Critic with Eligibility Traces (continuing), for estimating⇡✓⇡⇡⇤
actor critic algorithm one step TD:

(13.16)
very similar to box 199 but without h(s)

ww+↵wrˆv(S,w)
TD(0) update

Gt:t+1ˆv(St,w)
same as REINFORCE MC baseline, but with the sampled G replaced with a bootstrapped G

That is,wis a single component,w.
constant baseline?

If there is discounting (<1) itshould be treated as a form of termination, which can be done simply by includinga factor ofin the second term of
termination because discounting by gamma is equivalent to a nondisounted case, but with termination probability gamma

 Oct 2020

Local file Local file

Chenget al.[92] design a multichannel partsaggregated deep convolutional network byintegrating the local body part features and the global fullbody features in a triplet training framework
TODO: read this and find out what the philosophy behind partsbased model is??

adaptive average pooling
what is this?

Generation/Augmentation
TODO: read

Using theannotated source data in the training process of the targetdomain is beneficial for crossdataset learning
What? Clarify

Dynamic graph matching (DGM)
super interesting, but hardly applicable. do rad though!

Sample Rate Learning
what

Singular VectorDecomposition (SVDNet)
seems interesting, "iteratively integrate the orthogonality constraint in CNN training"

OmniScale Network (OSNet)
read paper again to see if any good ideas for architecture

bottleneck laye
Bottleneck layers do a 1x1 convolution to reduce the dimensionality, before a 3x3 convolution, to save computation
https://medium.com/@erikgaas/resnettorchvisionbottlenecksandlayersnotastheyseem145620f93096

Global Feature Representation Learning
someething that came up whilst looking through papers in attention: https://arxiv.org/pdf/1709.01507.pdf squeezeandexcitation

[68]
Partsbased paper, interesting approach
