[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency#6056
Conversation
- Remove `numba` from `isaaclab_tasks` dependencies.
- Replace AutoMate's Numba CUDA/CPU-JIT SoftDTW helper with a Torch
implementation that runs on the input tensor device.
- Add a no-grad anti-diagonal SoftDTW path plus
`forward_with_lengths(...)` so the AutoMate reward evaluates padded
variable-length reference segments in one batched call instead of one
SoftDTW call per environment.
- Clean up the autograd SoftDTW path to use a Torch DP table instead of
Python row lists, with a clearer docstring.
- Remove the Numba CUDA warning environment variable from `run_w_id.py`.
- Add focused SoftDTW tests for the no-Numba path, hard DTW, normalized
SoftDTW, variable-length padded SoftDTW, and finite backward gradients.
The original failure is not a sustainable place to solve with a NumPy
pin. AutoMate only needs SoftDTW values for reward computation; it does
not require the copied differentiable Numba implementation as a
package-level dependency. Keeping Numba also exposes a second failure
mode on RTX 5090: the old Numba CUDA kernel can fail at compile time
with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION` / unsupported PTX version.
This removes the dependency instead of constraining global NumPy
resolution.
- Focused tests pass in the develop venv: `python -m pytest
source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q` (`5
passed`).
- Focused tests pass in the beta2 venv where Numba import is broken (`5
passed`).
- `git diff --check` passes.
- `py_compile` passes for the touched Python files.
- Old-vs-new SoftDTW CPU forward parity: 594 finite cases across
`gamma={0.01,0.1,1.0}`, normalized/non-normalized valid cases, bandwidth
`{None,2,20}`, and sequence lengths up to `B=8,N=10,M=100`; max absolute
difference was `1.526e-04`.
- Mustafa's row/column Torch DP variant matched the current no-grad
implementation exactly in direct forward checks; for `B=128,N=10,M=100`,
it measured `82.497 ms` on CUDA versus `14.038 ms` for the anti-diagonal
no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward
inference while using the cleaner Torch DP-table style for autograd.
- For `gamma=0`, the old implementation returns `nan` on a simple
hard-DTW case; the new implementation returns the expected hard-DTW
value `1.0`.
- New SoftDTW autograd smoke test produces finite gradients.
- AutoMate reward parity: optimized length-aware reward path matches the
original per-env reward loop on synthetic AutoMate-shaped data (`128`
envs, `10` robot waypoints, `ref_len=100`, `gamma=0.01`); max absolute
error was `0.0` on CPU and CUDA.
Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch
`2.10.0+cu128`, `128` envs, `10` robot waypoints, `ref_len=100`,
`gamma=0.01`, `no_grad`:
| Path | CPU median | CUDA median | CUDA peak allocated delta |
| --- | ---: | ---: | ---: |
| Per-env Torch reward loop | `141.617 ms` | `355.131 ms` | `15.372 MB`
|
| Batched length-aware Torch reward | `13.483 ms` | `25.824 ms` |
`15.372 MB` |
The peak CUDA allocation in this reward benchmark is dominated by the
closest-state `torch.cdist` calculation, not by the SoftDTW table.
The previous Numba CUDA path could not be timed on this RTX 5090 because
it fails locally with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION`; the
performance comparison above is against the direct per-env Torch
replacement path that this PR would otherwise have used.
(cherry picked from commit dc51341)
There was a problem hiding this comment.
🤖 IsaacLab Review Bot — PR #6056
Cherry-pick of #6040 onto release/3.0.0-beta2
✅ Summary
This PR cleanly backports the Numba SoftDTW removal from the develop branch (#6040, approved after v4 review) to the beta2 release branch. The changes replace Numba CUDA/CPU-JIT kernels with a pure PyTorch SoftDTW implementation.
Review Checklist
| Area | Status | Notes |
|---|---|---|
| Numba dependency removed | ✅ | Removed from setup.py install_requires |
| Numba env var cleanup | ✅ | NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS removed from run_w_id.py |
| PyTorch SoftDTW implementation | ✅ | Clean separation: _soft_dtw_autograd (with grad), _soft_dtw_no_grad (anti-diagonal batched), _soft_dtw_variable_y_no_grad (padded variable-length) |
| Hard DTW support (γ=0) | ✅ | Correctly uses torch.minimum instead of logsumexp |
| Bandwidth pruning | ✅ | Sakoe-Chiba band properly applied in all paths |
| Batched reward path | ✅ | forward_with_lengths enables batched reward computation; fallback groups by length |
automate_algo_utils.py refactor |
✅ | Per-env loop replaced with batched call; correct beta2-specific conflict resolution |
| Backward/autograd support | ✅ | Autograd path via Torch ops preserved for differentiable use cases |
| Test coverage | ✅ | 5 focused tests: no-numba import, hard DTW value, normalized identity, variable-length padded, finite backward gradients |
| Changelog fragment | ✅ | fix-automate-numba-constraints.rst present |
| CI status | ⏳ | Pre-commit & build-wheel pass; installation tests pending |
| API compatibility | ✅ | SoftDTW.__init__ signature preserved (use_cuda, device kept for compat) |
Observations
-
Implementation quality: The anti-diagonal no-grad path is well-optimized for inference (14ms vs 82ms per the commit message). The autograd path uses a simpler row-by-row DP which is correct for gradient computation.
-
Variable-length handling:
_soft_dtw_variable_y_no_gradcorrectly masks invalid positions withinfand indexes the final result using per-sampley_lengths. Theforward_with_lengthsautograd fallback correctly loops per-sample (necessary since variable masking breaks autograd). -
Beta2 conflict resolution: PR body documents the beta2-specific adaptations (setup.py vs pyproject.toml packaging, test placement). The test file is at
source/isaaclab_tasks/test/test_automate_soft_dtw.py(beta2 test layout) rather thancontrib/(develop layout). -
Numerical correctness: Commit message reports max absolute difference of 1.526e-04 against the old implementation across 594 test cases, and exact parity (0.0 error) for the AutoMate reward path on synthetic data.
Minor Notes (non-blocking)
- The
_soft_dtw_autogradfunction uses nested Python loops (for i ... for j ...). This is fine for the reward-only use case (small sequences, rare autograd calls) but would be slow for large sequences requiring gradients. The commit message confirms this tradeoff is intentional.
Verdict
✅ LGTM — Clean cherry-pick with appropriate beta2 conflict resolution. The PyTorch implementation is correct, well-tested, and properly removes the Numba dependency that causes failures on newer GPUs (RTX 5090 PTX version errors).
Update (commit 407112d): New commits add three separate fixes to this beta2 PR:
- Environment destructor fix (
direct_rl_env.py,direct_marl_env.py,manager_based_env.py): Prevents__del__from emitting tracebacks during Python shutdown. Uses the standard pattern of capturingsysas a default arg and checkingsys.meta_path is not None. Also guards against re-entry when_is_closedis already True. ✅ Correct. - AutoMate collision stack (
assembly_env_cfg.py,disassembly_env_cfg.py): Addsgpu_collision_stack_size=2**27to avoid dropped contacts at 128 envs. ✅ Reasonable config change. - Run helper placeholder guard (
run_w_id.py,run_disassembly_w_id.py): Rejects literalASSEMBLY_IDplaceholder before launching simulation. ✅ Good UX improvement.
All new changes are clean. Previous inline comments (autograd cell loop performance, pytest tensor comparison) remain open — they were not addressed in these commits but are non-blocking per the original review.
No new issues found.
Greptile SummaryThis backport removes AutoMate's Numba dependency by replacing the Numba CUDA SoftDTW kernels with a pure-PyTorch implementation, and refactors
Confidence Score: 4/5Safe to merge for dependency removal; the new PyTorch DP is functionally correct for the reward computation path, but the autograd path has an untested performance concern and minor test quality gaps worth tracking. The reward-computation path (no-grad, inference) is well-vectorised and logically correct. The autograd path uses O(len_x × len_y) Python iterations with per-cell CopySlices nodes — structurally sound but potentially very slow for long sequences under training. The test suite was not executed due to missing torch, leaving the backward-pass behaviour unconfirmed. Two test quality issues (fragile pytest.approx comparison, CUDA tests running on CPU) reduce confidence slightly, but none of these affect correctness of the core Numba removal. soft_dtw_cuda.py (autograd path performance) and test_automate_soft_dtw.py (unrun tests, fragile assertion). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["SoftDTW.forward(X, Y)"] --> B{normalize?}
B -- yes --> C["Stack X+Y for 3 DTW calls\nthen combine outputs"]
B -- no --> D["D = dist_func(X, Y)"]
C --> E["_soft_dtw(D, gamma, bandwidth)"]
D --> E
E --> F{grad_enabled AND\nD.requires_grad?}
F -- yes --> G["_soft_dtw_autograd\n(cell-by-cell loop,\nO(len_x x len_y) iterations)"]
F -- no --> H["_soft_dtw_no_grad\n(anti-diagonal vectorised,\nO(len_x + len_y) iterations)"]
G --> I["return R[:, len_x, len_y]"]
H --> I
A2["SoftDTW.forward_with_lengths\n(X, Y, y_lengths)"] --> J{grad_enabled AND\nrequires_grad?}
J -- yes --> K["Per-sample loop\ncalling forward()"]
J -- no --> L["_soft_dtw_variable_y_no_grad\n(anti-diagonal + length masking)"]
K --> M["torch.cat outputs"]
L --> M
N["get_imitation_reward_from_dtw"] --> O{criterion has\nforward_with_lengths\nAND normalize=False?}
O -- yes --> P["Pad ref trajs to max_len\ncall forward_with_lengths"]
O -- no --> Q["Group envs by traj_len\nbatch call forward() per group"]
P --> R["imitation_rwd = 1 - tanh(soft_dtw)"]
Q --> R
|
| def _soft_dtw_autograd(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor: | ||
| """Compute SoftDTW using Torch ops that preserve autograd.""" | ||
| batch_size, len_x, len_y = D.shape | ||
| R = torch.full((batch_size, len_x + 2, len_y + 2), float("inf"), device=D.device, dtype=D.dtype) | ||
| R[:, 0, 0] = 0 | ||
|
|
||
| band_size = int(bandwidth) if bandwidth > 0 else max(len_x, len_y) | ||
| for i in range(1, len_x + 1): | ||
| j_start = max(1, i - band_size) | ||
| j_end = min(len_y, i + band_size) + 1 | ||
|
|
||
| # ---------------------------------------------------------------------------------------------------------------------- | ||
| @cuda.jit | ||
| def compute_softdtw_backward_cuda(D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E): | ||
| k = cuda.blockIdx.x | ||
| tid = cuda.threadIdx.x | ||
| for j in range(j_start, j_end): | ||
| r0 = R[:, i - 1, j - 1] | ||
| r1 = R[:, i - 1, j] | ||
| r2 = R[:, i, j - 1] | ||
|
|
||
| # Indexing logic is the same as above, however, the anti-diagonal needs to | ||
| # progress backwards | ||
| if gamma == 0: | ||
| softmin = torch.minimum(torch.minimum(r0, r1), r2) | ||
| else: | ||
| previous_costs = torch.stack((r0, r1, r2)) | ||
| softmin = -gamma * torch.logsumexp(-previous_costs / gamma, dim=0) | ||
|
|
||
| for p in range(n_passes): | ||
| # Reverse the order to make the loop go backward | ||
| rev_p = n_passes - p - 1 | ||
| R[:, i, j] = D[:, i - 1, j - 1] + softmin | ||
|
|
||
| # convert tid to I, J, then i, j | ||
| J = max(0, min(rev_p - tid, max_j - 1)) | ||
| return R[:, len_x, len_y] | ||
|
|
There was a problem hiding this comment.
_soft_dtw_autograd cell-by-cell loop vs. anti-diagonal vectorisation
_soft_dtw_no_grad processes entire anti-diagonals in one batched tensor op (O(len_x + len_y) iterations, each operating on a full diagonal slice). _soft_dtw_autograd loops over every individual (i, j) cell (O(len_x × len_y) Python iterations) and also creates a CopySlices autograd node per cell. For typical AutoMate trajectory lengths (e.g. 256 steps), that is ~65 k Python loop iterations and ~65 k graph nodes per forward pass, vs. ~511 iterations in the no-grad path. During training with gradient tracking this will be several orders of magnitude slower than the previous CUDA implementation, potentially dominating step time.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| x = torch.tensor([[[0.0], [1.0]]]) | ||
| y = torch.tensor([[[0.0], [2.0]]]) | ||
|
|
||
| assert criterion(x, y) == pytest.approx(torch.tensor([1.0])) |
There was a problem hiding this comment.
The comparison relies on undocumented pytest × PyTorch tensor interaction. Extracting a scalar via
.item() and comparing against a Python float is explicit and version-proof.
| assert criterion(x, y) == pytest.approx(torch.tensor([1.0])) | |
| assert criterion(x, y).item() == pytest.approx(1.0) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
fc6e8c8
into
isaac-sim:release/3.0.0-beta2
Backports #6040 to release/3.0.0-beta2.
Beta2-specific conflict resolution:
Validation: