fs?zbmn ̄»z'sfm?eqxe ́zXw]qkrme « f quz'afcK|~zbjf»kvykreeghukrfOqeu_ ° z b|rà_'
epsilon greedy
fs?zbmn ̄»z'sfm?eqxe ́zXw]qkrme « f quz'afcK|~zbjf»kvykreeghukrfOqeu_ ° z b|rà_'
epsilon greedy
¿ÑÈÕ®©Âª z 'hua»z=Ŧßà Ò Ñ È Õ ®© z'hua»z=Ŧßà Ò Ñ È ©! =ä8&X©fz'hua»z=Ŧßà Ò Ñ È ©!
the crux
_shz'm]w±KÅQfO ̄à ̧³o*Õz'm ̄à
so rewards, and reward functions can be defined over S x A x S
Will they make exactly the same action selections and weightupdates?
no, on Q-learning the greedy action in the bellman equation is taken before the update, but the next step's action is generated from the updated Q.
whereas in SARSA with a greedy policy, the same greedy action is used in the update equation, and it is also used taken to generate the next state
Q-learning considered an o↵-policy control method?
because the policy for which Q estimates is the one that is greedy w.r.t Q, but the policy generating the samples can be anything
⇢ t (R t+1 + Gt+1:h )
see 5.9, this is per-decision importance sampling
Information Criterion
usually a function of the likelihood function plus some penalty for model complexity
tσ
sum of t independent innovations
1 + θ21 + θ22 + ···+ θ2q
due to having uncorrelated innovations
ex number λ: |λ|> 1,(1 − 1λL)−1
applying the inverse, renders the process an infinite order MA process
z |≤1}, i.e., |λj |> 1
geometric series
Li (ηt )
function on eta_t
ηt−
how does this affect the process over time?
e.g. changing interest rates on the long term economy
p/n →0
more data than parameters
St
S_t is a weighted average of white noise
✓✓+↵✓I rln⇡(A|S,✓)
actor-critic with state value baseline update, with discounting!
del ln(A|S, theta) is actually CrossEntropy
(δt+θ>tφt−θ>t−1φt)
modified TD error in terms of regular TD error
αφtδ′t
A_{t+1}^t = I
At−10θ0+αt−1∑i=0At−1i+1φiGλ|ti
we've achieved something special here - the recursive definition of theta_{t+1} on theta_t
= (I−αφtφ>t)θt+αt−1∑i=0Ati+1φi(γλ)t−iδ′t+αφt(Rt+1+γθt>φt+1)
this is already computationally quite nice, but these jokers want to incorporate the last term into the modified TD error
Gλ|t+1
lambda return = lambda weighted sum of all n-step returns up to time t+1
γλet−1+φt−αγλ(e>t−1φt)φt
the dutch trace update rule
G 1:3
weighted sum of all available n-step lambda returns from t=1
hanging the values of those past states for when theyoccur again in the future
assuming the special linear case, with a discrete state space and where the feature vector is a one-hot encoding, the TD-error multipled by the eligbility vector is the effect of the current TD-error on each state, with the effect amplified (or de-amplified) by each states' recency of occurence.
assign it backward to each prior state according to how much that state contributedto the current eligibility trace at that time.
the current TD error contributes less and less to those states that occur furhter back in time (using the linear case helps, where the eligibility trace is a sum of past, fading state input vectors).
T t 1Gt
sum of all remaining n-step returns
Qt+n 1(St,At)
should this be inside the bracket??
How about the change in left-side outcome from 0 to 1 madein the larger walk? Do you think that made any di↵erence in the best value ofn
yes, because rewards can propagate in from both sides now, making the optimal n shorter?
Gt:t+n.=Rt+1+ Rt+2+···+ n 1Rt+n+ nQt+n 1(St+n,At+n)
G(t:t+n) needs all rewards from time t+1
Qt+n(St,At).=Qt+n 1(St,At)+↵[Gt:t+n Qt+n 1(St,At)]
at the current time step t+n, we need the states and actions from time = t (i.e. n steps back)
+n 1
use the most up-to-date version
xpectation of being in a state depends only on thepolicy and the MDP transition probabilities
hmm sure starting states can have an impact?
[21, 39] directlyuse conventional CNN or deep belief networks (DBN)
interesting, read!
f⌧+n<T,then:GG+ nV(S⌧+n
V(terminal state) = 0
↵Gtr⇡(At|St,✓t)⇡(At|St,✓t)
notice that the multiplier of the gradient here: G_t / pi(a|s) is positive, meaning we are always going in the same direction as the gradient. using a baseline G_t - v(S_t) allows us to revers this direction if G_t is lower than the baseline
Actor–Critic with Eligibility Traces (continuing), for estimating⇡✓⇡⇡⇤
actor critic algorithm one step TD:
(13.16)
very similar to box 199 but without h(s)
ww+↵w rˆv(S,w)
TD(0) update
Gt:t+1 ˆv(St,w)
same as REINFORCE MC baseline, but with the sampled G replaced with a bootstrapped G
That is,wis a single component,w.
constant baseline?
If there is discounting ( <1) itshould be treated as a form of termination, which can be done simply by includinga factor of in the second term of
termination because discounting by gamma is equivalent to a non-disounted case, but with termination probability gamma
Chenget al.[92] design a multi-channel parts-aggregated deep convolutional network byintegrating the local body part features and the global full-body features in a triplet training framework
TODO: read this and find out what the philosophy behind parts-based model is??
adaptive average pooling
what is this?
Generation/Augmentation
TODO: read
Using theannotated source data in the training process of the targetdomain is beneficial for cross-dataset learning
What? Clarify
Dy-namic graph matching (DGM)
super interesting, but hardly applicable. do rad though!
Sample Rate Learning
what
Singular VectorDecomposition (SVDNet)
seems interesting, "iteratively integrate the orthogonality constraint in CNN training"
Omni-Scale Network (OSNet)
read paper again to see if any good ideas for architecture
bottleneck laye
Bottleneck layers do a 1x1 convolution to reduce the dimensionality, before a 3x3 convolution, to save computation
https://medium.com/@erikgaas/resnet-torchvision-bottlenecks-and-layers-not-as-they-seem-145620f93096
Global Feature Representation Learning
someething that came up whilst looking through papers in attention: https://arxiv.org/pdf/1709.01507.pdf squeeze-and-excitation
[68]
Parts-based paper, interesting approach