Skip to content

Fix/ppo playback, visualization, and remove time penalty#11

Merged
eisDNV merged 6 commits intomainfrom
fix/ppo-playback-and-visualization
Apr 30, 2026
Merged

Fix/ppo playback, visualization, and remove time penalty#11
eisDNV merged 6 commits intomainfrom
fix/ppo-playback-and-visualization

Conversation

@aleksandarbabicdnv
Copy link
Copy Markdown
Collaborator

Context

PR #10 introduced reward_fac — a configurable tuple of reward weights, the third of which adds −self.time × 0.001 to the reward each step. This breaks PPO in a fundamental way (see below). This PR also fixes two bugs that made play_ppo.py non-functional, adds a gamma parameter, and improves the episode plot.

Changes

Bug: infinite loop in play_ppo.py (commit 1)

ProximalPolicyOptimizationAgent.load() was missing the TimeLimit wrapper. The converged model produces near-zero reward, so the reward_limit threshold was never crossed and the episode ran forever. Added wrapper_class=TimeLimit, max_episode_steps=3000 to match the training configuration.

Feature: configurable gamma parameter (commit 2)

ProximalPolicyOptimizationAgent now accepts a gamma parameter (default 0.99), exposed as --gamma in train_ppo.py. Tested with γ=0.999 — converges slower for this task, since the pendulum control problem is reactive and local; a longer planning horizon adds value function complexity without benefit.

Fix: disable explicit time penalty for PPO — Markov violation (commit 3)

reward_fac[2] adds −self.time × 0.001 each step. self.time grows continuously but is not in the observation, violating the Markov property: two identical crane/pendulum states at different points in an episode produce different rewards. PPO's value function V(s) cannot condition on hidden state, so it is forced to average across the time distribution at each state — producing noisy gradient estimates and poor convergence.

Importantly, time preference is already encoded implicitly through the discount factor γ: V(s) = r_t + γ·r_{t+1} + γ²·r_{t+2} + … so the policy naturally prefers faster solutions without any extra signal. The explicit penalty is both redundant and harmful for function-approximation methods like PPO.

Q-learning tolerates it because dt=0.1 makes the penalty 10× smaller, and the tabular value function averages over many revisits to the same bucket, smoothing out the non-stationarity. reward_fac remains on the environment unchanged; only the PPO scripts override reward_fac[2] to 0.0.

Bug: episode plot never appeared (commit 4a)

render() had no handler for render_mode='plot'. show_plot() was only called from reset() at the start of the next episode, so with --episodes 1 the plot was never shown. Added the missing elif branch.

Improvement: show_plot visualization (commit 4b)

  • 4×1 vertical layout (was 2×2): all subplots share a common time axis at full width, making it easy to align crane and pendulum events visually.
  • Combined twin-axis legends: ax.legend() only captured primary-axis lines. Load speed, damping, and crane speed (on twin axes) were absent from the legend. Fixed by combining handles from both axes with get_legend_handles_labels().
  • plt.suptitle() instead of plt.title(): plt.title() attached to the last active axes, placing the text between subplots. suptitle() places it at the figure level.
  • Added suptitle stub to stubs/matplotlib-stubs/pyplot.pyi (pyright did not recognise it).

Without this wrapper, play_ppo.py ran indefinitely on a converged model.
The trained policy produces near-zero reward, so the reward_limit
threshold was never crossed and no termination signal fired.

TimeLimit caps episodes at max_episode_steps=3000, matching the
training configuration used in __init__ and resume().
Exposes the PPO discount factor as gamma (default 0.99) on
ProximalPolicyOptimizationAgent and as --gamma in train_ppo.py.

Tested with gamma=0.999: converges slower for this task. The pendulum
control problem is reactive and local — a longer planning horizon adds
value function complexity without improving policy quality. gamma=0.99
remains the default.
The reward_fac[2] term adds -self.time * 0.001 to the reward each step.
self.time grows continuously but is not part of the observation, violating
the Markov property: two identical crane/pendulum states at different
episode times produce different rewards. PPO's value function V(s) cannot
condition on hidden state, so it must average across the time distribution
at each state — producing noisy gradient estimates and erratic convergence.

Time preference is already encoded implicitly through the discount factor γ:
V(s) = r_t + γ*r_{t+1} + γ²*r_{t+2} + … so the policy naturally prefers
faster solutions without any extra signal. An explicit time penalty is
both redundant and harmful for function-approximation methods like PPO.

Q-learning tolerates it because dt=0.1 makes the penalty 10× smaller, and
the tabular value function averages over many revisits to the same state
bucket, smoothing out the non-stationarity.

Set reward_fac[2] = 0.0 in both train_ppo.py and play_ppo.py. The
reward_fac parameter remains on the environment so Q-learning can still
use it; only the PPO scripts override it explicitly.
render() had no handler for render_mode='plot'. show_plot() was only
called from reset() at the start of the next episode, so with a single
episode run the plot never appeared.
- Switch from 2×2 grid to 4×1 vertical layout (figsize 16×12): gives
  each subplot full width and a shared time axis, making crane-pendulum
  interactions easier to align visually.
- Combine primary and twin-axis legend handles so load speed, damping,
  and crane speed (on ax1y2/ax2y2) appear in the legend. ax.legend()
  alone only captured primary-axis lines.
- Use plt.suptitle() instead of plt.title() for the figure-level title.
  plt.title() attached to the last active axes (ax4), placing the text
  between subplots. suptitle() places it above all subplots.
- Add loc="upper left" to ax2 legend to avoid overlap with the twin
  y-axis spine on the right.
- Add suptitle stub to stubs/matplotlib-stubs/pyplot.pyi (pyright did
  not recognise plt.suptitle without it).

Also updates CHANGELOG.md with all changes from this branch.
@aleksandarbabicdnv
Copy link
Copy Markdown
Collaborator Author

Here is an "play" image after training PPO for 2M steps.

image

Copy link
Copy Markdown
Collaborator

@eisDNV eisDNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and nice results.

@eisDNV eisDNV merged commit e55b3c4 into main Apr 30, 2026
20 checks passed
@eisDNV eisDNV deleted the fix/ppo-playback-and-visualization branch April 30, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants