Introduction to Policy Gradient

Problem Setup

For simplicity, let us first consider a one-step decision-making process (also known as a contextual bandit).

Let $s \in S$ be the state vector of the environment, describing the current situation or configuration the agent observes (for example, the position of a robot, or the market condition in trading). Let $a \in A$ be the action taken by the agent — a decision or move that affects the environment (e.g., moving left/right, buying/selling). The reward function, denoted by $r (a, s)$ , assigns a numerical score to each action–state pair, indicating how desirable the action is when taken in that state.

The agent’s behavior is characterized by a policy, represented as a conditional probability distribution $π (a ∣ s),$ which specifies the probability of selecting each possible action $a$ given the current state $s$ .

The objective is to learn an optimal policy $π^{*}$ that maximizes the expected reward under the environment’s state distribution. Formally, we define the performance of a policy $π$ as

R (π) = E_{s \sim p_{0}} E_{a \sim π (\cdot ∣ s)} [r (a, s)],

which can equivalently be written as a joint expectation:

R (π) = s, a \sum p_{π} (s, a) r (a, s),

where $p_{0} (s)$ denotes the (unknown) distribution of states in the environment, and

p_{π} (s, a) = p_{0} (s) π (a ∣ s)

is the joint distribution of states and actions induced by the policy $π$ .

In practice, the state distribution $p_{0}$ is rarely known in closed form; instead, we have access to a finite set of samples ${s^{(i)}}_{i = 1}^{N}$ drawn from $p_{0}$ , typically obtained from interaction with the environment or an existing dataset.

Policy Gradient

In practice, the policy $π_{θ} (a ∣ s)$ is parameterized by a differentiable function. The gradient of the expected reward is:

\nabla_{θ} R (π_{θ}) = E_{p_{π_{θ}}} [r (a, s) \nabla_{θ} lo g π_{θ} (a ∣ s)] .

(Used the log-derivative trick: $\nabla_{θ} π_{θ} / π_{θ} = \nabla_{θ} lo g π_{θ}$ .)

Reward Baseline

For any policy $π_{θ}$ :

E_{a \sim π_{θ} (\cdot ∣ s)} [\nabla_{θ} lo g π_{θ} (a ∣ s)] = 0.

Hence,

\nabla_{θ} R (π_{θ}) = E_{p_{π_{θ}}} [(r (a, s) - v (s)) \nabla_{θ} lo g π_{θ} (a ∣ s)],

where $v (s)$ is any function independent of $a$ .

Choosing $v (s) = E_{a} [r (a, s)]$ gives the advantage function:

A (a, s) = r (a, s) - v (s) .

A positive $A (a, s)$ means the action performs above average.

On-Policy Estimation

Given data ${(s_{i}, a_{i})}$ with $s_{i} \sim p_{0}$ and $a_{i} \sim π_{θ} (\cdot ∣ s_{i})$ ,

\nabla_{θ} R (π_{θ}) \approx \frac{1}{n} i = 1 \sum n A_{i} \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i}),

with $A_{i} = A (a_{i}, s_{i})$ .

Gradient ascent update (REINFORCE):

θ \leftarrow θ + ϵ \nabla_{θ} R (π_{θ}) .

Intuitively:

$A_{i} > 0$ : increase $π_{θ} (a_{i} ∣ s_{i})$ ;
$A_{i} < 0$ : decrease it.

Connection to Weighted MLE

If data $(a_{i}, s_{i})$ are fixed, the policy gradient equals the gradient of the advantage-weighted log-likelihood:

ℓ (π_{θ}) = \frac{1}{n} i = 1 \sum n A_{i} lo g π_{θ} (a_{i} ∣ s_{i}) .

Thus, policy optimization ≈ adaptive MLE with dynamic weights $A_{i}$ .

Off-Policy Estimation (Importance Sampling)

If data are drawn from a behavior policy $π^{data}$ :

\nabla_{θ} R (π_{θ}) = E_{p_{π^{data}}} [\frac{π _{θ} ( a ∣ s )}{π ^{data} ( a ∣ s )} A (a, s) \nabla_{θ} lo g π_{θ} (a ∣ s)] .

Empirically,

\nabla_{θ} R (π_{θ}) \approx \frac{1}{z} i = 1 \sum n w_{i} A_{i} \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i}), w_{i} = \frac{π _{θ} ( a _{i} ∣ s _{i} )}{π ^{data} ( a _{i} ∣ s _{i} )},

with normalization $z = \sum_{i} w_{i}$ .

Proximal Policy Optimization (PPO)

Proximal Point Method

Starting from $θ^{0}$ , iterate:

θ^{k + 1} = ar g θ max R (θ) - β D (θ, θ^{k}),

where $D$ measures the deviation between parameters (e.g., squared distance or KL divergence).

This forms the basis for proximal methods.

Proximal Gradient & Preconditioning

Approximate $R (θ)$ by its first-order Taylor expansion:

θ^{k + 1} = ar g θ max \nabla R (θ^{k})^{⊤} (θ - θ^{k}) - β D (θ, θ^{k}) .

If $D (θ, θ^{k}) = \frac{1}{2} (θ - θ^{k})^{⊤} J (θ^{k}) (θ - θ^{k})$ ,

then the update is:

θ^{k + 1} = θ^{k} + J (θ^{k})^{- 1} \nabla R (θ^{k}),

known as preconditioned gradient descent.

Remark. The proximal point framework interpolates between fast-updating methods (gradient descent) and slower, more implicit schemes, depending on how the reference point $θ^{ref, k}$ evolves.

$θ^{ref, k} = θ^{k}$ : proximal gradient descent

$θ^{ref, k} = θ^{m ⌊ k / m ⌋}$ : inner-loop updates

EMA of past iterates → smoother, more stable updates

Proximal Policy Optimization (PPO)

Applying the proximal framework to policy optimization with importance sampling:

θ^{k + 1} = ar g θ max {E_{D} [\frac{π ^{θ} ( a ∣ s )}{π ^{data} ( a ∣ s )} r (a, s)] - β D (π^{θ} ∥ π^{θ^{k}})} .

The penalty term is usually the KL divergence:

D (π^{θ} ∥ π^{θ^{k}}) = E_{s \sim p_{0}} [KL (π^{θ} (\cdot ∣ s) ∥ π^{θ^{k}} (\cdot ∣ s))] .

Clipping Objective

PPO further introduces a clipping function:

θ^{k + 1} = ar g θ max E_{D} [C (w_{θ} (a, s), r (a, s))] - β D (π^{θ} ∥ π^{θ^{k}}),

where $w_{θ} (a, s) = \frac{π ^{θ} ( a ∣ s )}{π ^{data} ( a ∣ s )}$ and

C (w, r) = min (w r, clip (w, [1 - ϵ, 1 + ϵ]), r) .

Different choices of penalty $D$ and clipping function yield different stability–efficiency tradeoffs.

Policy Optimization: a Second Look

Contents

Introduction to Policy Gradient

Problem Setup

Policy Gradient

Reward Baseline

A positive $A (a, s)$ means the action performs above average.

On-Policy Estimation

Connection to Weighted MLE

Off-Policy Estimation (Importance Sampling)

Proximal Policy Optimization (PPO)

Proximal Point Method

Proximal Gradient & Preconditioning

Proximal Policy Optimization (PPO)

Clipping Objective

Table of Contents

Policy Optimization: a Second Look

Contents

Introduction to Policy Gradient

Problem Setup

Policy Gradient

Reward Baseline

A positive A(a,s) means the action performs above average.

On-Policy Estimation

Connection to Weighted MLE

Off-Policy Estimation (Importance Sampling)

Proximal Policy Optimization (PPO)

Proximal Point Method

Proximal Gradient & Preconditioning

Proximal Policy Optimization (PPO)

Clipping Objective

Table of Contents

A positive $A (a, s)$ means the action performs above average.