Skip to content

[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency#6056

Merged
AntoineRichard merged 2 commits into
isaac-sim:release/3.0.0-beta2from
ooctipus:fix/beta2-automate-softdtw-no-numba
Jun 9, 2026
Merged

[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency#6056
AntoineRichard merged 2 commits into
isaac-sim:release/3.0.0-beta2from
ooctipus:fix/beta2-automate-softdtw-no-numba

Conversation

@ooctipus

@ooctipus ooctipus commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Backports #6040 to release/3.0.0-beta2.

Beta2-specific conflict resolution:

  • kept beta2's existing pyproject.toml packaging shape
  • removed numba from source/isaaclab_tasks/setup.py, where beta2 still declares install_requires
  • placed the new SoftDTW test under source/isaaclab_tasks/test and updated it to load isaaclab_tasks/direct/automate/soft_dtw_cuda.py

Validation:

  • git diff --check refs/remotes/upstream/release/3.0.0-beta2...HEAD
  • python3 -m py_compile source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py source/isaaclab_tasks/test/test_automate_soft_dtw.py source/isaaclab_tasks/setup.py
  • python3 -m pytest source/isaaclab_tasks/test/test_automate_soft_dtw.py -q could not run locally because the active system Python does not have torch installed

- Remove `numba` from `isaaclab_tasks` dependencies.
- Replace AutoMate's Numba CUDA/CPU-JIT SoftDTW helper with a Torch
implementation that runs on the input tensor device.
- Add a no-grad anti-diagonal SoftDTW path plus
`forward_with_lengths(...)` so the AutoMate reward evaluates padded
variable-length reference segments in one batched call instead of one
SoftDTW call per environment.
- Clean up the autograd SoftDTW path to use a Torch DP table instead of
Python row lists, with a clearer docstring.
- Remove the Numba CUDA warning environment variable from `run_w_id.py`.
- Add focused SoftDTW tests for the no-Numba path, hard DTW, normalized
SoftDTW, variable-length padded SoftDTW, and finite backward gradients.

The original failure is not a sustainable place to solve with a NumPy
pin. AutoMate only needs SoftDTW values for reward computation; it does
not require the copied differentiable Numba implementation as a
package-level dependency. Keeping Numba also exposes a second failure
mode on RTX 5090: the old Numba CUDA kernel can fail at compile time
with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION` / unsupported PTX version.

This removes the dependency instead of constraining global NumPy
resolution.

- Focused tests pass in the develop venv: `python -m pytest
source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q` (`5
passed`).
- Focused tests pass in the beta2 venv where Numba import is broken (`5
passed`).
- `git diff --check` passes.
- `py_compile` passes for the touched Python files.
- Old-vs-new SoftDTW CPU forward parity: 594 finite cases across
`gamma={0.01,0.1,1.0}`, normalized/non-normalized valid cases, bandwidth
`{None,2,20}`, and sequence lengths up to `B=8,N=10,M=100`; max absolute
difference was `1.526e-04`.
- Mustafa's row/column Torch DP variant matched the current no-grad
implementation exactly in direct forward checks; for `B=128,N=10,M=100`,
it measured `82.497 ms` on CUDA versus `14.038 ms` for the anti-diagonal
no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward
inference while using the cleaner Torch DP-table style for autograd.
- For `gamma=0`, the old implementation returns `nan` on a simple
hard-DTW case; the new implementation returns the expected hard-DTW
value `1.0`.
- New SoftDTW autograd smoke test produces finite gradients.
- AutoMate reward parity: optimized length-aware reward path matches the
original per-env reward loop on synthetic AutoMate-shaped data (`128`
envs, `10` robot waypoints, `ref_len=100`, `gamma=0.01`); max absolute
error was `0.0` on CPU and CUDA.

Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch
`2.10.0+cu128`, `128` envs, `10` robot waypoints, `ref_len=100`,
`gamma=0.01`, `no_grad`:

| Path | CPU median | CUDA median | CUDA peak allocated delta |
| --- | ---: | ---: | ---: |
| Per-env Torch reward loop | `141.617 ms` | `355.131 ms` | `15.372 MB`
|
| Batched length-aware Torch reward | `13.483 ms` | `25.824 ms` |
`15.372 MB` |

The peak CUDA allocation in this reward benchmark is dominated by the
closest-state `torch.cdist` calculation, not by the SoftDTW table.

The previous Numba CUDA path could not be timed on this RTX 5090 because
it fails locally with `CUDA_ERROR_UNSUPPORTED_PTX_VERSION`; the
performance comparison above is against the direct per-env Torch
replacement path that this PR would otherwise have used.

(cherry picked from commit dc51341)
@ooctipus ooctipus requested a review from kellyguo11 as a code owner June 9, 2026 03:19
@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Jun 9, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 IsaacLab Review Bot — PR #6056

Cherry-pick of #6040 onto release/3.0.0-beta2

✅ Summary

This PR cleanly backports the Numba SoftDTW removal from the develop branch (#6040, approved after v4 review) to the beta2 release branch. The changes replace Numba CUDA/CPU-JIT kernels with a pure PyTorch SoftDTW implementation.

Review Checklist

Area Status Notes
Numba dependency removed Removed from setup.py install_requires
Numba env var cleanup NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS removed from run_w_id.py
PyTorch SoftDTW implementation Clean separation: _soft_dtw_autograd (with grad), _soft_dtw_no_grad (anti-diagonal batched), _soft_dtw_variable_y_no_grad (padded variable-length)
Hard DTW support (γ=0) Correctly uses torch.minimum instead of logsumexp
Bandwidth pruning Sakoe-Chiba band properly applied in all paths
Batched reward path forward_with_lengths enables batched reward computation; fallback groups by length
automate_algo_utils.py refactor Per-env loop replaced with batched call; correct beta2-specific conflict resolution
Backward/autograd support Autograd path via Torch ops preserved for differentiable use cases
Test coverage 5 focused tests: no-numba import, hard DTW value, normalized identity, variable-length padded, finite backward gradients
Changelog fragment fix-automate-numba-constraints.rst present
CI status Pre-commit & build-wheel pass; installation tests pending
API compatibility SoftDTW.__init__ signature preserved (use_cuda, device kept for compat)

Observations

  1. Implementation quality: The anti-diagonal no-grad path is well-optimized for inference (14ms vs 82ms per the commit message). The autograd path uses a simpler row-by-row DP which is correct for gradient computation.

  2. Variable-length handling: _soft_dtw_variable_y_no_grad correctly masks invalid positions with inf and indexes the final result using per-sample y_lengths. The forward_with_lengths autograd fallback correctly loops per-sample (necessary since variable masking breaks autograd).

  3. Beta2 conflict resolution: PR body documents the beta2-specific adaptations (setup.py vs pyproject.toml packaging, test placement). The test file is at source/isaaclab_tasks/test/test_automate_soft_dtw.py (beta2 test layout) rather than contrib/ (develop layout).

  4. Numerical correctness: Commit message reports max absolute difference of 1.526e-04 against the old implementation across 594 test cases, and exact parity (0.0 error) for the AutoMate reward path on synthetic data.

Minor Notes (non-blocking)

  • The _soft_dtw_autograd function uses nested Python loops (for i ... for j ...). This is fine for the reward-only use case (small sequences, rare autograd calls) but would be slow for large sequences requiring gradients. The commit message confirms this tradeoff is intentional.

Verdict

✅ LGTM — Clean cherry-pick with appropriate beta2 conflict resolution. The PyTorch implementation is correct, well-tested, and properly removes the Numba dependency that causes failures on newer GPUs (RTX 5090 PTX version errors).


Update (commit 407112d): New commits add three separate fixes to this beta2 PR:

  1. Environment destructor fix (direct_rl_env.py, direct_marl_env.py, manager_based_env.py): Prevents __del__ from emitting tracebacks during Python shutdown. Uses the standard pattern of capturing sys as a default arg and checking sys.meta_path is not None. Also guards against re-entry when _is_closed is already True. ✅ Correct.
  2. AutoMate collision stack (assembly_env_cfg.py, disassembly_env_cfg.py): Adds gpu_collision_stack_size=2**27 to avoid dropped contacts at 128 envs. ✅ Reasonable config change.
  3. Run helper placeholder guard (run_w_id.py, run_disassembly_w_id.py): Rejects literal ASSEMBLY_ID placeholder before launching simulation. ✅ Good UX improvement.

All new changes are clean. Previous inline comments (autograd cell loop performance, pytest tensor comparison) remain open — they were not addressed in these commits but are non-blocking per the original review.

No new issues found.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This backport removes AutoMate's Numba dependency by replacing the Numba CUDA SoftDTW kernels with a pure-PyTorch implementation, and refactors get_imitation_reward_from_dtw to exploit a new forward_with_lengths API that batches variable-length reference trajectories in one call.

  • soft_dtw_cuda.py: New _soft_dtw_no_grad uses efficient anti-diagonal vectorisation (O(len_x + len_y) tensor ops); _soft_dtw_autograd falls back to a per-cell Python loop (O(len_x × len_y) iterations + one CopySlices graph node per cell) which may be noticeably slower during gradient-enabled training. forward_with_lengths is added for padded batch inputs.
  • automate_algo_utils.py: Environments are now batched by shared reference-trajectory length (or via forward_with_lengths), replacing the previous per-environment serial DTW calls.
  • test_automate_soft_dtw.py: Adds unit tests for the new implementation; the test suite was not executed (torch not available on the author's machine), and one assertion uses an undocumented pytest × PyTorch comparison pattern.

Confidence Score: 4/5

Safe to merge for dependency removal; the new PyTorch DP is functionally correct for the reward computation path, but the autograd path has an untested performance concern and minor test quality gaps worth tracking.

The reward-computation path (no-grad, inference) is well-vectorised and logically correct. The autograd path uses O(len_x × len_y) Python iterations with per-cell CopySlices nodes — structurally sound but potentially very slow for long sequences under training. The test suite was not executed due to missing torch, leaving the backward-pass behaviour unconfirmed. Two test quality issues (fragile pytest.approx comparison, CUDA tests running on CPU) reduce confidence slightly, but none of these affect correctness of the core Numba removal.

soft_dtw_cuda.py (autograd path performance) and test_automate_soft_dtw.py (unrun tests, fragile assertion).

Important Files Changed

Filename Overview
source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py Replaces Numba CUDA kernels with pure PyTorch DP; the no-grad path is well-vectorised (anti-diagonal batching), but the autograd path falls back to O(len_x × len_y) per-cell Python loops which will be significantly slower during gradient-enabled training.
source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py Refactors per-environment DTW loop to batch calls via new forward_with_lengths API or by grouping environments sharing the same reference trajectory length; logic and shapes look correct.
source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py Removes the NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS env-var injection and passes env=None (inherited) to subprocess; clean and correct.
source/isaaclab_tasks/setup.py Drops numba>=0.63.1 from install_requires; no other changes.
source/isaaclab_tasks/test/test_automate_soft_dtw.py New test file covering basic SoftDTW behaviour, forward_with_lengths, and backward pass; has a fragile pytest.approx(torch.tensor(...)) comparison, and CUDA-tagged tests run on CPU without a device guard, leaving GPU behaviour uncovered. Tests were not run due to missing torch on the author's system.
source/isaaclab_tasks/changelog.d/fix-automate-numba-constraints.rst Changelog entry for the Numba removal; accurate and complete.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["SoftDTW.forward(X, Y)"] --> B{normalize?}
    B -- yes --> C["Stack X+Y for 3 DTW calls\nthen combine outputs"]
    B -- no --> D["D = dist_func(X, Y)"]
    C --> E["_soft_dtw(D, gamma, bandwidth)"]
    D --> E
    E --> F{grad_enabled AND\nD.requires_grad?}
    F -- yes --> G["_soft_dtw_autograd\n(cell-by-cell loop,\nO(len_x x len_y) iterations)"]
    F -- no --> H["_soft_dtw_no_grad\n(anti-diagonal vectorised,\nO(len_x + len_y) iterations)"]
    G --> I["return R[:, len_x, len_y]"]
    H --> I
    A2["SoftDTW.forward_with_lengths\n(X, Y, y_lengths)"] --> J{grad_enabled AND\nrequires_grad?}
    J -- yes --> K["Per-sample loop\ncalling forward()"]
    J -- no --> L["_soft_dtw_variable_y_no_grad\n(anti-diagonal + length masking)"]
    K --> M["torch.cat outputs"]
    L --> M
    N["get_imitation_reward_from_dtw"] --> O{criterion has\nforward_with_lengths\nAND normalize=False?}
    O -- yes --> P["Pad ref trajs to max_len\ncall forward_with_lengths"]
    O -- no --> Q["Group envs by traj_len\nbatch call forward() per group"]
    P --> R["imitation_rwd = 1 - tanh(soft_dtw)"]
    Q --> R
Loading

Comments Outside Diff (2)

  1. source/isaaclab_tasks/test/test_automate_soft_dtw.py, line 701-708 (link)

    P2 pytest.approx comparison against a torch.Tensor is unreliable

    pytest.approx(torch.tensor([1.0])) works by iterating the tensor as a Python sequence, so the comparison succeeds if the list representation matches approximately — but this behavior is undocumented and depends on how pytest inspects the argument. If criterion(x, y) returns a CUDA tensor or a tensor whose __eq__ short-circuits before pytest can inspect it, the assertion may pass vacuously or raise. Using torch.allclose or extracting a Python scalar (.item()) is the idiomatic and reliable pattern here.

  2. source/isaaclab_tasks/test/test_automate_soft_dtw.py, line 691-717 (link)

    P2 CUDA-tagged tests run on CPU, leaving device-specific behaviour untested

    test_soft_dtw_use_cuda_does_not_require_numba and test_normalized_soft_dtw_identical_sequences_are_zero both construct SoftDTW(use_cuda=True, device="cuda", ...) but pass plain CPU tensors. Because use_cuda and device are now ignored, the tests happen to pass without a GPU, but they provide zero coverage of a) mixed-device errors, b) numeric fidelity on CUDA, and c) any future path that reintroduces device routing. Adding a pytest.importorskip/skipif guard conditioned on torch.cuda.is_available() and moving the tensors to .cuda() inside those tests would make the intent explicit.

Reviews (1): Last reviewed commit: "Remove AutoMate Numba SoftDTW dependency..." | Re-trigger Greptile

Comment on lines +32 to 57
def _soft_dtw_autograd(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor:
"""Compute SoftDTW using Torch ops that preserve autograd."""
batch_size, len_x, len_y = D.shape
R = torch.full((batch_size, len_x + 2, len_y + 2), float("inf"), device=D.device, dtype=D.dtype)
R[:, 0, 0] = 0

band_size = int(bandwidth) if bandwidth > 0 else max(len_x, len_y)
for i in range(1, len_x + 1):
j_start = max(1, i - band_size)
j_end = min(len_y, i + band_size) + 1

# ----------------------------------------------------------------------------------------------------------------------
@cuda.jit
def compute_softdtw_backward_cuda(D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E):
k = cuda.blockIdx.x
tid = cuda.threadIdx.x
for j in range(j_start, j_end):
r0 = R[:, i - 1, j - 1]
r1 = R[:, i - 1, j]
r2 = R[:, i, j - 1]

# Indexing logic is the same as above, however, the anti-diagonal needs to
# progress backwards
if gamma == 0:
softmin = torch.minimum(torch.minimum(r0, r1), r2)
else:
previous_costs = torch.stack((r0, r1, r2))
softmin = -gamma * torch.logsumexp(-previous_costs / gamma, dim=0)

for p in range(n_passes):
# Reverse the order to make the loop go backward
rev_p = n_passes - p - 1
R[:, i, j] = D[:, i - 1, j - 1] + softmin

# convert tid to I, J, then i, j
J = max(0, min(rev_p - tid, max_j - 1))
return R[:, len_x, len_y]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _soft_dtw_autograd cell-by-cell loop vs. anti-diagonal vectorisation

_soft_dtw_no_grad processes entire anti-diagonals in one batched tensor op (O(len_x + len_y) iterations, each operating on a full diagonal slice). _soft_dtw_autograd loops over every individual (i, j) cell (O(len_x × len_y) Python iterations) and also creates a CopySlices autograd node per cell. For typical AutoMate trajectory lengths (e.g. 256 steps), that is ~65 k Python loop iterations and ~65 k graph nodes per forward pass, vs. ~511 iterations in the no-grad path. During training with gradient tracking this will be several orders of magnitude slower than the previous CUDA implementation, potentially dominating step time.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

x = torch.tensor([[[0.0], [1.0]]])
y = torch.tensor([[[0.0], [2.0]]])

assert criterion(x, y) == pytest.approx(torch.tensor([1.0]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comparison relies on undocumented pytest × PyTorch tensor interaction. Extracting a scalar via .item() and comparing against a Python float is explicit and version-proof.

Suggested change
assert criterion(x, y) == pytest.approx(torch.tensor([1.0]))
assert criterion(x, y).item() == pytest.approx(1.0)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@AntoineRichard AntoineRichard merged commit fc6e8c8 into isaac-sim:release/3.0.0-beta2 Jun 9, 2026
37 checks passed
@ooctipus ooctipus deleted the fix/beta2-automate-softdtw-no-numba branch June 9, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants