Summary of policy gradients using backprop (last ten minutes):
The 'policy' is defined as the probabilities of taking an action given a history of observations: \(p(a_t | h_t)\), with \(h_t = (o_0, o_1, ..., o_t)\).
Reward comes from each action as \(r_t(a_t)\), expected return is:
$$ J(\theta) = E[ \sum_{t=0}^{T} r_t(a_t)] $$
The gradient of the expected reward with respect to the parameters \(\theta\) (= "which direction should I change \(\theta\)?") is taken as follows:
Sample (& average over) many action sequences (= play many games), for each sequence \((a_0,a_1, ..., a_T)\) computing:
---- The sum of, for each action \(a_t\) in the sequence:
---- ---- [which direction to change theta to make my action \(a_t\) more (log) probable given history \(h_t\)] * [the total reward gotten from this action and subsequent actions]
The whole thing can be read as: if an action in a game led to high rewards, try to do that action more often when in the same situation.