Fix/ppo playback, visualization, and remove time penalty by aleksandarbabicdnv · Pull Request #11 · dnv-opensource/crane-controller

aleksandarbabicdnv · 2026-04-30T07:47:04Z

Context

PR #10 introduced reward_fac — a configurable tuple of reward weights, the third of which adds −self.time × 0.001 to the reward each step. This breaks PPO in a fundamental way (see below). This PR also fixes two bugs that made play_ppo.py non-functional, adds a gamma parameter, and improves the episode plot.

Changes

Bug: infinite loop in `play_ppo.py` (commit 1)

ProximalPolicyOptimizationAgent.load() was missing the TimeLimit wrapper. The converged model produces near-zero reward, so the reward_limit threshold was never crossed and the episode ran forever. Added wrapper_class=TimeLimit, max_episode_steps=3000 to match the training configuration.

Feature: configurable `gamma` parameter (commit 2)

ProximalPolicyOptimizationAgent now accepts a gamma parameter (default 0.99), exposed as --gamma in train_ppo.py. Tested with γ=0.999 — converges slower for this task, since the pendulum control problem is reactive and local; a longer planning horizon adds value function complexity without benefit.

Fix: disable explicit time penalty for PPO — Markov violation (commit 3)

reward_fac[2] adds −self.time × 0.001 each step. self.time grows continuously but is not in the observation, violating the Markov property: two identical crane/pendulum states at different points in an episode produce different rewards. PPO's value function V(s) cannot condition on hidden state, so it is forced to average across the time distribution at each state — producing noisy gradient estimates and poor convergence.

Importantly, time preference is already encoded implicitly through the discount factor γ: V(s) = r_t + γ·r_{t+1} + γ²·r_{t+2} + … so the policy naturally prefers faster solutions without any extra signal. The explicit penalty is both redundant and harmful for function-approximation methods like PPO.

Q-learning tolerates it because dt=0.1 makes the penalty 10× smaller, and the tabular value function averages over many revisits to the same bucket, smoothing out the non-stationarity. reward_fac remains on the environment unchanged; only the PPO scripts override reward_fac[2] to 0.0.

Bug: episode plot never appeared (commit 4a)

render() had no handler for render_mode='plot'. show_plot() was only called from reset() at the start of the next episode, so with --episodes 1 the plot was never shown. Added the missing elif branch.

Improvement: `show_plot` visualization (commit 4b)

4×1 vertical layout (was 2×2): all subplots share a common time axis at full width, making it easy to align crane and pendulum events visually.
Combined twin-axis legends: ax.legend() only captured primary-axis lines. Load speed, damping, and crane speed (on twin axes) were absent from the legend. Fixed by combining handles from both axes with get_legend_handles_labels().
plt.suptitle() instead of plt.title(): plt.title() attached to the last active axes, placing the text between subplots. suptitle() places it at the figure level.
Added suptitle stub to stubs/matplotlib-stubs/pyplot.pyi (pyright did not recognise it).

Without this wrapper, play_ppo.py ran indefinitely on a converged model. The trained policy produces near-zero reward, so the reward_limit threshold was never crossed and no termination signal fired. TimeLimit caps episodes at max_episode_steps=3000, matching the training configuration used in __init__ and resume().

Exposes the PPO discount factor as gamma (default 0.99) on ProximalPolicyOptimizationAgent and as --gamma in train_ppo.py. Tested with gamma=0.999: converges slower for this task. The pendulum control problem is reactive and local — a longer planning horizon adds value function complexity without improving policy quality. gamma=0.99 remains the default.

The reward_fac[2] term adds -self.time * 0.001 to the reward each step. self.time grows continuously but is not part of the observation, violating the Markov property: two identical crane/pendulum states at different episode times produce different rewards. PPO's value function V(s) cannot condition on hidden state, so it must average across the time distribution at each state — producing noisy gradient estimates and erratic convergence. Time preference is already encoded implicitly through the discount factor γ: V(s) = r_t + γ*r_{t+1} + γ²*r_{t+2} + … so the policy naturally prefers faster solutions without any extra signal. An explicit time penalty is both redundant and harmful for function-approximation methods like PPO. Q-learning tolerates it because dt=0.1 makes the penalty 10× smaller, and the tabular value function averages over many revisits to the same state bucket, smoothing out the non-stationarity. Set reward_fac[2] = 0.0 in both train_ppo.py and play_ppo.py. The reward_fac parameter remains on the environment so Q-learning can still use it; only the PPO scripts override it explicitly.

render() had no handler for render_mode='plot'. show_plot() was only called from reset() at the start of the next episode, so with a single episode run the plot never appeared.

- Switch from 2×2 grid to 4×1 vertical layout (figsize 16×12): gives each subplot full width and a shared time axis, making crane-pendulum interactions easier to align visually. - Combine primary and twin-axis legend handles so load speed, damping, and crane speed (on ax1y2/ax2y2) appear in the legend. ax.legend() alone only captured primary-axis lines. - Use plt.suptitle() instead of plt.title() for the figure-level title. plt.title() attached to the last active axes (ax4), placing the text between subplots. suptitle() places it above all subplots. - Add loc="upper left" to ax2 legend to avoid overlap with the twin y-axis spine on the right. - Add suptitle stub to stubs/matplotlib-stubs/pyplot.pyi (pyright did not recognise plt.suptitle without it). Also updates CHANGELOG.md with all changes from this branch.

aleksandarbabicdnv · 2026-04-30T07:50:05Z

Here is an "play" image after training PPO for 2M steps.

eisDNV

Nice and nice results.

aleksandarbabicdnv added 6 commits April 30, 2026 08:32

fix: trigger show_plot from render() so plot appears after episode

fd2500a

render() had no handler for render_mode='plot'. show_plot() was only called from reset() at the start of the next episode, so with a single episode run the plot never appeared.

fix: add missing type: ignore[arg-type] on TimeLimit in load()

448312f

aleksandarbabicdnv requested a review from eisDNV April 30, 2026 07:47

aleksandarbabicdnv assigned eisDNV Apr 30, 2026

eisDNV approved these changes Apr 30, 2026

View reviewed changes

eisDNV merged commit e55b3c4 into main Apr 30, 2026
20 checks passed

eisDNV deleted the fix/ppo-playback-and-visualization branch April 30, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/ppo playback, visualization, and remove time penalty#11

Fix/ppo playback, visualization, and remove time penalty#11
eisDNV merged 6 commits intomainfrom
fix/ppo-playback-and-visualization

aleksandarbabicdnv commented Apr 30, 2026

Uh oh!

aleksandarbabicdnv commented Apr 30, 2026

Uh oh!

eisDNV left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aleksandarbabicdnv commented Apr 30, 2026

Context

Changes

Bug: infinite loop in play_ppo.py (commit 1)

Feature: configurable gamma parameter (commit 2)

Fix: disable explicit time penalty for PPO — Markov violation (commit 3)

Bug: episode plot never appeared (commit 4a)

Improvement: show_plot visualization (commit 4b)

Uh oh!

aleksandarbabicdnv commented Apr 30, 2026

Uh oh!

eisDNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug: infinite loop in `play_ppo.py` (commit 1)

Feature: configurable `gamma` parameter (commit 2)

Improvement: `show_plot` visualization (commit 4b)