Problem Setup
For simplicity, let us first consider a one-step decision-making process (also known as a contextual bandit).

Let be the state vector of the environment, describing the current situation or configuration the agent observes (for example, the position of a robot, or the market condition in trading). Let be the action taken by the agent — a decision or move that affects the environment (e.g., moving left/right, buying/selling). The reward function, denoted by , assigns a numerical score to each action–state pair, indicating how desirable the action is when taken in that state.
The agent’s behavior is characterized by a policy, represented as a conditional probability distribution which specifies the probability of selecting each possible action given the current state .
The objective is to learn an optimal policy that maximizes the expected reward under the environment’s state distribution. Formally, we define the performance of a policy as
which can equivalently be written as a joint expectation:
where denotes the (unknown) distribution of states in the environment, and
is the joint distribution of states and actions induced by the policy .
In practice, the state distribution is rarely known in closed form; instead, we have access to a finite set of samples drawn from , typically obtained from interaction with the environment or an existing dataset.
Policy Gradient
In practice, the policy is parameterized by a differentiable function. The gradient of the expected reward is:
(Used the log-derivative trick: .)
Reward Baseline
For any policy :
Hence,
where is any function independent of .
Choosing gives the advantage function:
A positive means the action performs above average.
On-Policy Estimation
Given data with and ,
with .
Gradient ascent update (REINFORCE):
Intuitively:
- : increase ;
- : decrease it.
Connection to Weighted MLE
If data are fixed, the policy gradient equals the gradient of the advantage-weighted log-likelihood:
Thus, policy optimization ≈ adaptive MLE with dynamic weights .
Off-Policy Estimation (Importance Sampling)
If data are drawn from a behavior policy :
Empirically,
with normalization .
Proximal Policy Optimization (PPO)
Proximal Point Method
Starting from , iterate:
where measures the deviation between parameters (e.g., squared distance or KL divergence).
This forms the basis for proximal methods.
Proximal Gradient & Preconditioning
Approximate by its first-order Taylor expansion:
If ,
then the update is:
known as preconditioned gradient descent.
Remark. The proximal point framework interpolates between fast-updating methods (gradient descent) and slower, more implicit schemes, depending on how the reference point evolves.
- : proximal gradient descent
- : inner-loop updates
- EMA of past iterates → smoother, more stable updates
Proximal Policy Optimization (PPO)
Applying the proximal framework to policy optimization with importance sampling:
The penalty term is usually the KL divergence:
Clipping Objective
PPO further introduces a clipping function:
where and
Different choices of penalty and clipping function yield different stability–efficiency tradeoffs.