Proximal Policy Optimization (PPO)

Applying the proximal framework to policy optimization with importance sampling:

[ \theta^{k+1} = \arg\max_{\theta} \Big{ \E_{\mathcal D} \Big[ \frac{\pi^\theta(a|s)}{\pi^{\text{data}}(a|s)}, r(a,s) \Big]

  • \beta D(\pi^\theta \Vert \pi^{\theta^k}) \Big}. ]

The penalty term is usually the KL divergence:

[ D(\pi^\theta \Vert \pi^{\theta^k}) = \E_{s\sim p_0} \big[ \mathrm{KL}(\pi^\theta(\cdot|s) \Vert \pi^{\theta^k}(\cdot|s)) \big]. ]