[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency by ooctipus · Pull Request #6056 · isaac-sim/IsaacLab

ooctipus · 2026-06-09T03:19:30Z

Backports #6040 to release/3.0.0-beta2.

Beta2-specific conflict resolution:

kept beta2's existing pyproject.toml packaging shape
removed numba from source/isaaclab_tasks/setup.py, where beta2 still declares install_requires
placed the new SoftDTW test under source/isaaclab_tasks/test and updated it to load isaaclab_tasks/direct/automate/soft_dtw_cuda.py

Validation:

git diff --check refs/remotes/upstream/release/3.0.0-beta2...HEAD
python3 -m py_compile source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py source/isaaclab_tasks/test/test_automate_soft_dtw.py source/isaaclab_tasks/setup.py
python3 -m pytest source/isaaclab_tasks/test/test_automate_soft_dtw.py -q could not run locally because the active system Python does not have torch installed

- Remove `numba` from `isaaclab_tasks` dependencies. - Replace AutoMate's Numba CUDA/CPU-JIT SoftDTW helper with a Torch implementation that runs on the input tensor device. - Add a no-grad anti-diagonal SoftDTW path plus `forward_with_lengths(...)` so the AutoMate reward evaluates padded variable-length reference segments in one batched call instead of one SoftDTW call per environment. - Clean up the autograd SoftDTW path to use a Torch DP table instead of Python row lists, with a clearer docstring. - Remove the Numba CUDA warning environment variable from `run_w_id.py`. - Add focused SoftDTW tests for the no-Numba path, hard DTW, normalized SoftDTW, variable-length padded SoftDTW, and finite backward gradients. The original failure is not a sustainable place to solve with a NumPy pin. AutoMate only needs SoftDTW values for reward computation; it does not require the copied differentiable Numba implementation as a package-level dependency. Keeping Numba also exposes a second failure mode on RTX 5090: the old Numba CUDA kernel can fail at compile time with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION` / unsupported PTX version. This removes the dependency instead of constraining global NumPy resolution. - Focused tests pass in the develop venv: `python -m pytest source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q` (`5 passed`). - Focused tests pass in the beta2 venv where Numba import is broken (`5 passed`). - `git diff --check` passes. - `py_compile` passes for the touched Python files. - Old-vs-new SoftDTW CPU forward parity: 594 finite cases across `gamma={0.01,0.1,1.0}`, normalized/non-normalized valid cases, bandwidth `{None,2,20}`, and sequence lengths up to `B=8,N=10,M=100`; max absolute difference was `1.526e-04`. - Mustafa's row/column Torch DP variant matched the current no-grad implementation exactly in direct forward checks; for `B=128,N=10,M=100`, it measured `82.497 ms` on CUDA versus `14.038 ms` for the anti-diagonal no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward inference while using the cleaner Torch DP-table style for autograd. - For `gamma=0`, the old implementation returns `nan` on a simple hard-DTW case; the new implementation returns the expected hard-DTW value `1.0`. - New SoftDTW autograd smoke test produces finite gradients. - AutoMate reward parity: optimized length-aware reward path matches the original per-env reward loop on synthetic AutoMate-shaped data (`128` envs, `10` robot waypoints, `ref_len=100`, `gamma=0.01`); max absolute error was `0.0` on CPU and CUDA. Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch `2.10.0+cu128`, `128` envs, `10` robot waypoints, `ref_len=100`, `gamma=0.01`, `no_grad`: | Path | CPU median | CUDA median | CUDA peak allocated delta | | --- | ---: | ---: | ---: | | Per-env Torch reward loop | `141.617 ms` | `355.131 ms` | `15.372 MB` | | Batched length-aware Torch reward | `13.483 ms` | `25.824 ms` | `15.372 MB` | The peak CUDA allocation in this reward benchmark is dominated by the closest-state `torch.cdist` calculation, not by the SoftDTW table. The previous Numba CUDA path could not be timed on this RTX 5090 because it fails locally with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION`; the performance comparison above is against the direct per-env Torch replacement path that this PR would otherwise have used. (cherry picked from commit dc51341)

isaaclab-review-bot

🤖 IsaacLab Review Bot — PR #6056

Cherry-pick of #6040 onto release/3.0.0-beta2

✅ Summary

This PR cleanly backports the Numba SoftDTW removal from the develop branch (#6040, approved after v4 review) to the beta2 release branch. The changes replace Numba CUDA/CPU-JIT kernels with a pure PyTorch SoftDTW implementation.

Review Checklist

Area	Status	Notes
Numba dependency removed	✅	Removed from `setup.py` `install_requires`
Numba env var cleanup	✅	`NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS` removed from `run_w_id.py`
PyTorch SoftDTW implementation	✅	Clean separation: `_soft_dtw_autograd` (with grad), `_soft_dtw_no_grad` (anti-diagonal batched), `_soft_dtw_variable_y_no_grad` (padded variable-length)
Hard DTW support (γ=0)	✅	Correctly uses `torch.minimum` instead of logsumexp
Bandwidth pruning	✅	Sakoe-Chiba band properly applied in all paths
Batched reward path	✅	`forward_with_lengths` enables batched reward computation; fallback groups by length
`automate_algo_utils.py` refactor	✅	Per-env loop replaced with batched call; correct beta2-specific conflict resolution
Backward/autograd support	✅	Autograd path via Torch ops preserved for differentiable use cases
Test coverage	✅	5 focused tests: no-numba import, hard DTW value, normalized identity, variable-length padded, finite backward gradients
Changelog fragment	✅	`fix-automate-numba-constraints.rst` present
CI status	⏳	Pre-commit & build-wheel pass; installation tests pending
API compatibility	✅	`SoftDTW.__init__` signature preserved (`use_cuda`, `device` kept for compat)

Observations

Implementation quality: The anti-diagonal no-grad path is well-optimized for inference (14ms vs 82ms per the commit message). The autograd path uses a simpler row-by-row DP which is correct for gradient computation.
Variable-length handling: _soft_dtw_variable_y_no_grad correctly masks invalid positions with inf and indexes the final result using per-sample y_lengths. The forward_with_lengths autograd fallback correctly loops per-sample (necessary since variable masking breaks autograd).
Beta2 conflict resolution: PR body documents the beta2-specific adaptations (setup.py vs pyproject.toml packaging, test placement). The test file is at source/isaaclab_tasks/test/test_automate_soft_dtw.py (beta2 test layout) rather than contrib/ (develop layout).
Numerical correctness: Commit message reports max absolute difference of 1.526e-04 against the old implementation across 594 test cases, and exact parity (0.0 error) for the AutoMate reward path on synthetic data.

Minor Notes (non-blocking)

The _soft_dtw_autograd function uses nested Python loops (for i ... for j ...). This is fine for the reward-only use case (small sequences, rare autograd calls) but would be slow for large sequences requiring gradients. The commit message confirms this tradeoff is intentional.

Verdict

✅ LGTM — Clean cherry-pick with appropriate beta2 conflict resolution. The PyTorch implementation is correct, well-tested, and properly removes the Numba dependency that causes failures on newer GPUs (RTX 5090 PTX version errors).

Update (commit 407112d): New commits add three separate fixes to this beta2 PR:

Environment destructor fix (direct_rl_env.py, direct_marl_env.py, manager_based_env.py): Prevents __del__ from emitting tracebacks during Python shutdown. Uses the standard pattern of capturing sys as a default arg and checking sys.meta_path is not None. Also guards against re-entry when _is_closed is already True. ✅ Correct.
AutoMate collision stack (assembly_env_cfg.py, disassembly_env_cfg.py): Adds gpu_collision_stack_size=2**27 to avoid dropped contacts at 128 envs. ✅ Reasonable config change.
Run helper placeholder guard (run_w_id.py, run_disassembly_w_id.py): Rejects literal ASSEMBLY_ID placeholder before launching simulation. ✅ Good UX improvement.

All new changes are clean. Previous inline comments (autograd cell loop performance, pytest tensor comparison) remain open — they were not addressed in these commits but are non-blocking per the original review.

No new issues found.

greptile-apps · 2026-06-09T03:29:18Z

Greptile Summary

This backport removes AutoMate's Numba dependency by replacing the Numba CUDA SoftDTW kernels with a pure-PyTorch implementation, and refactors get_imitation_reward_from_dtw to exploit a new forward_with_lengths API that batches variable-length reference trajectories in one call.

soft_dtw_cuda.py: New _soft_dtw_no_grad uses efficient anti-diagonal vectorisation (O(len_x + len_y) tensor ops); _soft_dtw_autograd falls back to a per-cell Python loop (O(len_x × len_y) iterations + one CopySlices graph node per cell) which may be noticeably slower during gradient-enabled training. forward_with_lengths is added for padded batch inputs.
automate_algo_utils.py: Environments are now batched by shared reference-trajectory length (or via forward_with_lengths), replacing the previous per-environment serial DTW calls.
test_automate_soft_dtw.py: Adds unit tests for the new implementation; the test suite was not executed (torch not available on the author's machine), and one assertion uses an undocumented pytest × PyTorch comparison pattern.

Confidence Score: 4/5

Safe to merge for dependency removal; the new PyTorch DP is functionally correct for the reward computation path, but the autograd path has an untested performance concern and minor test quality gaps worth tracking.

The reward-computation path (no-grad, inference) is well-vectorised and logically correct. The autograd path uses O(len_x × len_y) Python iterations with per-cell CopySlices nodes — structurally sound but potentially very slow for long sequences under training. The test suite was not executed due to missing torch, leaving the backward-pass behaviour unconfirmed. Two test quality issues (fragile pytest.approx comparison, CUDA tests running on CPU) reduce confidence slightly, but none of these affect correctness of the core Numba removal.

soft_dtw_cuda.py (autograd path performance) and test_automate_soft_dtw.py (unrun tests, fragile assertion).

Important Files Changed

Filename	Overview
source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py	Replaces Numba CUDA kernels with pure PyTorch DP; the no-grad path is well-vectorised (anti-diagonal batching), but the autograd path falls back to O(len_x × len_y) per-cell Python loops which will be significantly slower during gradient-enabled training.
source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py	Refactors per-environment DTW loop to batch calls via new `forward_with_lengths` API or by grouping environments sharing the same reference trajectory length; logic and shapes look correct.
source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py	Removes the `NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS` env-var injection and passes `env=None` (inherited) to subprocess; clean and correct.
source/isaaclab_tasks/setup.py	Drops `numba>=0.63.1` from `install_requires`; no other changes.
source/isaaclab_tasks/test/test_automate_soft_dtw.py	New test file covering basic SoftDTW behaviour, `forward_with_lengths`, and backward pass; has a fragile `pytest.approx(torch.tensor(...))` comparison, and CUDA-tagged tests run on CPU without a device guard, leaving GPU behaviour uncovered. Tests were not run due to missing torch on the author's system.
source/isaaclab_tasks/changelog.d/fix-automate-numba-constraints.rst	Changelog entry for the Numba removal; accurate and complete.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["SoftDTW.forward(X, Y)"] --> B{normalize?}
    B -- yes --> C["Stack X+Y for 3 DTW calls\nthen combine outputs"]
    B -- no --> D["D = dist_func(X, Y)"]
    C --> E["_soft_dtw(D, gamma, bandwidth)"]
    D --> E
    E --> F{grad_enabled AND\nD.requires_grad?}
    F -- yes --> G["_soft_dtw_autograd\n(cell-by-cell loop,\nO(len_x x len_y) iterations)"]
    F -- no --> H["_soft_dtw_no_grad\n(anti-diagonal vectorised,\nO(len_x + len_y) iterations)"]
    G --> I["return R[:, len_x, len_y]"]
    H --> I
    A2["SoftDTW.forward_with_lengths\n(X, Y, y_lengths)"] --> J{grad_enabled AND\nrequires_grad?}
    J -- yes --> K["Per-sample loop\ncalling forward()"]
    J -- no --> L["_soft_dtw_variable_y_no_grad\n(anti-diagonal + length masking)"]
    K --> M["torch.cat outputs"]
    L --> M
    N["get_imitation_reward_from_dtw"] --> O{criterion has\nforward_with_lengths\nAND normalize=False?}
    O -- yes --> P["Pad ref trajs to max_len\ncall forward_with_lengths"]
    O -- no --> Q["Group envs by traj_len\nbatch call forward() per group"]
    P --> R["imitation_rwd = 1 - tanh(soft_dtw)"]
    Q --> R

Comments Outside Diff (2)

source/isaaclab_tasks/test/test_automate_soft_dtw.py, line 701-708 (link)

pytest.approx comparison against a torch.Tensor is unreliable

pytest.approx(torch.tensor([1.0])) works by iterating the tensor as a Python sequence, so the comparison succeeds if the list representation matches approximately — but this behavior is undocumented and depends on how pytest inspects the argument. If criterion(x, y) returns a CUDA tensor or a tensor whose __eq__ short-circuits before pytest can inspect it, the assertion may pass vacuously or raise. Using torch.allclose or extracting a Python scalar (.item()) is the idiomatic and reliable pattern here.
source/isaaclab_tasks/test/test_automate_soft_dtw.py, line 691-717 (link)

CUDA-tagged tests run on CPU, leaving device-specific behaviour untested

test_soft_dtw_use_cuda_does_not_require_numba and test_normalized_soft_dtw_identical_sequences_are_zero both construct SoftDTW(use_cuda=True, device="cuda", ...) but pass plain CPU tensors. Because use_cuda and device are now ignored, the tests happen to pass without a GPU, but they provide zero coverage of a) mixed-device errors, b) numeric fidelity on CUDA, and c) any future path that reintroduces device routing. Adding a pytest.importorskip/skipif guard conditioned on torch.cuda.is_available() and moving the tensors to .cuda() inside those tests would make the intent explicit.

_{Reviews (1): Last reviewed commit: "Remove AutoMate Numba SoftDTW dependency..." | Re-trigger Greptile}

greptile-apps · 2026-06-09T03:29:21Z

+def _soft_dtw_autograd(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor:
+    """Compute SoftDTW using Torch ops that preserve autograd."""
+    batch_size, len_x, len_y = D.shape
+    R = torch.full((batch_size, len_x + 2, len_y + 2), float("inf"), device=D.device, dtype=D.dtype)
+    R[:, 0, 0] = 0

+    band_size = int(bandwidth) if bandwidth > 0 else max(len_x, len_y)
+    for i in range(1, len_x + 1):
+        j_start = max(1, i - band_size)
+        j_end = min(len_y, i + band_size) + 1

-# ----------------------------------------------------------------------------------------------------------------------
-@cuda.jit
-def compute_softdtw_backward_cuda(D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E):
-    k = cuda.blockIdx.x
-    tid = cuda.threadIdx.x
+        for j in range(j_start, j_end):
+            r0 = R[:, i - 1, j - 1]
+            r1 = R[:, i - 1, j]
+            r2 = R[:, i, j - 1]

-    # Indexing logic is the same as above, however, the anti-diagonal needs to
-    # progress backwards
+            if gamma == 0:
+                softmin = torch.minimum(torch.minimum(r0, r1), r2)
+            else:
+                previous_costs = torch.stack((r0, r1, r2))
+                softmin = -gamma * torch.logsumexp(-previous_costs / gamma, dim=0)

-    for p in range(n_passes):
-        # Reverse the order to make the loop go backward
-        rev_p = n_passes - p - 1
+            R[:, i, j] = D[:, i - 1, j - 1] + softmin

-        # convert tid to I, J, then i, j
-        J = max(0, min(rev_p - tid, max_j - 1))
+    return R[:, len_x, len_y]



_soft_dtw_autograd cell-by-cell loop vs. anti-diagonal vectorisation

_soft_dtw_no_grad processes entire anti-diagonals in one batched tensor op (O(len_x + len_y) iterations, each operating on a full diagonal slice). _soft_dtw_autograd loops over every individual (i, j) cell (O(len_x × len_y) Python iterations) and also creates a CopySlices autograd node per cell. For typical AutoMate trajectory lengths (e.g. 256 steps), that is ~65 k Python loop iterations and ~65 k graph nodes per forward pass, vs. ~511 iterations in the no-grad path. During training with gradient tracking this will be several orders of magnitude slower than the previous CUDA implementation, potentially dominating step time.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-09T03:29:24Z

+    x = torch.tensor([[[0.0], [1.0]]])
+    y = torch.tensor([[[0.0], [2.0]]])
+
+    assert criterion(x, y) == pytest.approx(torch.tensor([1.0]))


The comparison relies on undocumented pytest × PyTorch tensor interaction. Extracting a scalar via .item() and comparing against a Python float is explicit and version-proof.

Suggested change

assert criterion(x, y) == pytest.approx(torch.tensor([1.0]))

assert criterion(x, y).item() == pytest.approx(1.0)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…-numba

ooctipus requested a review from kellyguo11 as a code owner June 9, 2026 03:19

github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Jun 9, 2026

isaaclab-review-bot Bot reviewed Jun 9, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Merge branch 'release/3.0.0-beta2' into fix/beta2-automate-softdtw-no…

407112d

…-numba

AntoineRichard approved these changes Jun 9, 2026

View reviewed changes

AntoineRichard merged commit fc6e8c8 into isaac-sim:release/3.0.0-beta2 Jun 9, 2026
37 checks passed

ooctipus deleted the fix/beta2-automate-softdtw-no-numba branch June 9, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency#6056

[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency#6056
AntoineRichard merged 2 commits into
isaac-sim:release/3.0.0-beta2from
ooctipus:fix/beta2-automate-softdtw-no-numba

ooctipus commented Jun 9, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert criterion(x, y) == pytest.approx(torch.tensor([1.0]))
	assert criterion(x, y).item() == pytest.approx(1.0)

Conversation

ooctipus commented Jun 9, 2026

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 IsaacLab Review Bot — PR #6056

✅ Summary

Review Checklist

Observations

Minor Notes (non-blocking)

Verdict

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isaaclab-review-bot Bot left a comment •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading