Remove AutoMate Numba SoftDTW dependency by ooctipus · Pull Request #6040 · isaac-sim/IsaacLab

ooctipus · 2026-06-08T22:26:07Z

Summary

Remove numba from isaaclab_tasks dependencies.
Replace AutoMate's Numba CUDA/CPU-JIT SoftDTW helper with a Torch implementation that runs on the input tensor device.
Add a no-grad anti-diagonal SoftDTW path plus forward_with_lengths(...) so the AutoMate reward evaluates padded variable-length reference segments in one batched call instead of one SoftDTW call per environment.
Clean up the autograd SoftDTW path to use a Torch DP table instead of Python row lists, with a clearer docstring.
Remove the Numba CUDA warning environment variable from run_w_id.py.
Add focused SoftDTW tests for the no-Numba path, hard DTW, normalized SoftDTW, variable-length padded SoftDTW, and finite backward gradients.

Rationale

The original failure is not a sustainable place to solve with a NumPy pin. AutoMate only needs SoftDTW values for reward computation; it does not require the copied differentiable Numba implementation as a package-level dependency. Keeping Numba also exposes a second failure mode on RTX 5090: the old Numba CUDA kernel can fail at compile time with CUDA_ERROR_UNSUPPORTED_PTX_VERSION / unsupported PTX version.

This removes the dependency instead of constraining global NumPy resolution.

Verification

Focused tests pass in the develop venv: python -m pytest source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q (5 passed).
Focused tests pass in the beta2 venv where Numba import is broken (5 passed).
git diff --check passes.
py_compile passes for the touched Python files.
Old-vs-new SoftDTW CPU forward parity: 594 finite cases across gamma={0.01,0.1,1.0}, normalized/non-normalized valid cases, bandwidth {None,2,20}, and sequence lengths up to B=8,N=10,M=100; max absolute difference was 1.526e-04.
Mustafa's row/column Torch DP variant matched the current no-grad implementation exactly in direct forward checks; for B=128,N=10,M=100, it measured 82.497 ms on CUDA versus 14.038 ms for the anti-diagonal no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward inference while using the cleaner Torch DP-table style for autograd.
For gamma=0, the old implementation returns nan on a simple hard-DTW case; the new implementation returns the expected hard-DTW value 1.0.
New SoftDTW autograd smoke test produces finite gradients.
AutoMate reward parity: optimized length-aware reward path matches the original per-env reward loop on synthetic AutoMate-shaped data (128 envs, 10 robot waypoints, ref_len=100, gamma=0.01); max absolute error was 0.0 on CPU and CUDA.

Performance

Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch 2.10.0+cu128, 128 envs, 10 robot waypoints, ref_len=100, gamma=0.01, no_grad:

Path	CPU median	CUDA median	CUDA peak allocated delta
Per-env Torch reward loop	`141.617 ms`	`355.131 ms`	`15.372 MB`
Batched length-aware Torch reward	`13.483 ms`	`25.824 ms`	`15.372 MB`

The peak CUDA allocation in this reward benchmark is dominated by the closest-state torch.cdist calculation, not by the SoftDTW table.

The previous Numba CUDA path could not be timed on this RTX 5090 because it fails locally with CUDA_ERROR_UNSUPPORTED_PTX_VERSION; the performance comparison above is against the direct per-env Torch replacement path that this PR would otherwise have used.

greptile-apps · 2026-06-08T22:28:32Z

Greptile Summary

This PR constrains the NumPy and Numba dependency bounds to prevent incompatible pre-release stacks (numba==0.66.0rc1 + numpy==2.5.0rc1) from being resolved under prerelease = "allow", which was breaking soft_dtw_cuda.py CUDA imports in AutoMate training.

The root pyproject.toml uv workspace override is tightened from numpy>=2 to numpy>=2,<2.5, blocking NumPy 2.5.x project-wide in the workspace.
source/isaaclab_tasks/pyproject.toml pins Numba to <0.66, keeping the resolved stack at numba==0.65.1 + numpy==2.4.6.
A changelog fragment is added under isaaclab_tasks/changelog.d/ in the correct RST format.

Confidence Score: 4/5

The workspace fix is sound, but the numpy upper bound lives only in the uv workspace override and not in the package's own metadata, leaving standalone installs unprotected.

The workspace-level numpy cap and the numba <0.66 bound together correctly block the broken pre-release stack for anyone using the workspace. The gap is that isaaclab_tasks/pyproject.toml still advertises numpy>=2 without an upper bound, so a direct non-workspace install can still resolve the incompatible combination. The changelog fragment and overall approach are correct.

source/isaaclab_tasks/pyproject.toml — the numpy>=2 dependency line is missing the <2.5 upper bound that was added to the workspace override.

Important Files Changed

Filename	Overview
pyproject.toml	Root uv workspace override tightened from `numpy>=2` to `numpy>=2,<2.5` to block pre-release NumPy 2.5 that Numba 0.65 rejects; effective for workspace installs only.
source/isaaclab_tasks/pyproject.toml	Numba capped at `<0.66` to pin to the 0.65 line, but `numpy>=2` is still unbounded, leaving standalone installs unprotected from the NumPy 2.5 incompatibility.
source/isaaclab_tasks/changelog.d/fix-automate-numba-constraints.rst	New changelog fragment added in the correct `changelog.d/` directory with the correct RST section structure matching existing entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["uv resolve\n(prerelease=allow)"] --> B{Root override-dependencies}
    B -->|"Before: numpy>=2"| C["numpy==2.5.0rc1\nnumba==0.66.0rc1\nllvmlite==0.48.0rc1"]
    B -->|"After: numpy>=2,<2.5"| D["numpy==2.4.6\nnumba==0.65.1\nnumba-cuda==0.30.2"]
    C --> E["❌ import numba.cuda FAILS\n(Numba rejects NumPy 2.5 RC)"]
    D --> F["✅ import numba.cuda OK\nsoft_dtw_cuda.py works"]
    G["isaaclab_tasks/pyproject.toml\nnumba>=0.63.1,<0.66"] --> D
    H["⚠️ numpy>=2 (no upper bound)\nin isaaclab_tasks pyproject"] -.->|standalone install| C

Comments Outside Diff (1)

source/isaaclab_tasks/pyproject.toml, line 20 (link)

The numpy>=2 lower-bound-only constraint in the package's own pyproject.toml is not paired with the <2.5 upper bound that was added to the root uv workspace override. A user who installs isaaclab_tasks directly (e.g. pip install isaaclab-tasks or uv add isaaclab-tasks outside this workspace) will still be able to resolve numba==0.65.x together with numpy==2.5.x, recreating exactly the broken stack this PR is fixing. The numpy cap should live in the package metadata, not only in the workspace [tool.uv] override.

_{Reviews (1): Last reviewed commit: "Constrain AutoMate Numba dependency stac..." | Re-trigger Greptile}

isaaclab-review-bot

🤖 Isaac Lab Review Bot

PR #6040 — Remove AutoMate Numba SoftDTW dependency

Summary

This PR adds upper-bound constraints to prevent uv from resolving pre-release Numba/NumPy stacks that break numba.cuda imports (used by AutoMate SoftDTW).

Changes Reviewed

File	Change
`pyproject.toml`	Root uv override: `numpy>=2` → `numpy>=2,<2.5`
`source/isaaclab_tasks/pyproject.toml`	`numba>=0.63.1` → `numba>=0.63.1,<0.66`
`changelog.d/...rst`	New changelog fragment

Findings

✅ Root cause is well-explained — The prerelease = "allow" setting combined with the unbounded numpy>=2 override was pulling in RC versions of numba, numpy, and llvmlite that are incompatible.
✅ Constraint approach is correct — Pinning numpy<2.5 at the override level and numba<0.66 at the package level keeps the resolver on the stable 0.65.x line while still allowing future patch releases.
⚠️ Minor: Upper bounds will need maintenance — When Numba 0.66 and NumPy 2.5 are officially released (non-RC), these caps should be revisited. Consider adding a comment in pyproject.toml explaining why the cap exists (e.g., # Cap to avoid pre-release stacks; bump when numba 0.66 stable is verified).
✅ Changelog fragment present — Follows project conventions (changelog.d/ with .rst format, proper Fixed category).
✅ CI checks passing — Pre-commit, license-check, wheel build, changelog check, and labeler all pass. Docker/installation tests still pending but unrelated to dependency metadata.

Verdict

👍 LGTM — Clean, minimal fix with good root-cause analysis. The constraints are appropriately scoped and the resolver dry-run verification is convincing. Only suggestion is to add inline comments for future maintainers explaining the upper bounds.

Automated review by isaaclab-review-bot • SHA: 8d47fb7

📝 Update (d0e0cac5) — Approach Changed: Numba Removed Entirely

The PR direction has shifted significantly. Instead of constraining the Numba/NumPy version stack, this commit removes the Numba dependency entirely by rewriting the SoftDTW implementation in pure PyTorch.

New Changes

File	Change
`soft_dtw_cuda.py`	Replaced ~350 lines of Numba CUDA/CPU kernels + `torch.autograd.Function` with a ~30-line pure-Torch `_soft_dtw()` function using `torch.logsumexp`
`source/isaaclab_tasks/pyproject.toml`	Removed `numba>=0.63.1,<0.66` dependency entirely
`pyproject.toml`	Reverted numpy override back to `numpy>=2` (upper bound no longer needed)
`run_w_id.py`	Removed `NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS=0` env var and `os` import
`changelog.d/...rst`	Updated to reflect the new approach
`test/contrib/test_automate_soft_dtw.py`	New — Unit tests verifying the Torch implementation works without Numba

Findings on New Approach

✅ Root cause eliminated — Rather than constraining a fragile dependency chain, Numba is gone. No more numba/llvmlite/numpy version compatibility issues. This is a stronger fix.
✅ API preserved — SoftDTW class retains the same constructor signature (use_cuda, device, gamma, normalize, bandwidth, dist_func). Existing callers will not break.
⚠️ Performance consideration — The new implementation uses Python-level loops over the DTW matrix (for i in range(1, len_x + 1): for j in range(1, len_y + 1):). For short sequences this is fine, but for long sequences this will be significantly slower than the previous Numba CUDA kernel. Worth noting in the changelog or docstring if large sequences are expected.
⚠️ No autograd backward — The previous implementation had explicit backward() via torch.autograd.Function. The new _soft_dtw() relies on PyTorch's autograd graph through the standard ops (logsumexp, stack, tensor arithmetic). This should work correctly but may use more memory for long sequences due to the computational graph.
✅ Tests added — Three test cases covering: no-Numba requirement, hard DTW correctness (gamma=0), and normalized mode (identical sequences → 0). Good coverage for a replacement.
✅ gamma=0 handled — Falls back to torch.minimum for hard DTW, avoiding division by zero.

Updated Verdict

👍 LGTM — This is a better approach than version pinning. Eliminating Numba removes the fragile dependency chain entirely and simplifies the build. The pure-Torch implementation is clean, well-tested, and API-compatible. The only trade-off is potential performance regression for very long sequences, which is acceptable given AutoMate's typical use case.

Previous concern about "upper bounds needing maintenance" is now resolved — there are no upper bounds to maintain.

Updated review by isaaclab-review-bot • SHA: d0e0cac5

📝 Update (1b0e2424) — Performance Optimizations for Batched SoftDTW

This commit addresses the performance concerns raised in the previous review by adding optimized inference paths and batched variable-length SoftDTW computation.

New Changes

File	Change
`soft_dtw_cuda.py`	Added `_soft_dtw_no_grad()` (anti-diagonal batched DP), `_soft_dtw_variable_y_no_grad()` (variable-length Y), and `forward_with_lengths()` method
`automate_algo_utils.py`	Refactored reward computation to batch SoftDTW calls instead of per-env loop
`test_automate_soft_dtw.py`	Added test for `forward_with_lengths()` correctness

Findings on Performance Optimizations

✅ Anti-diagonal DP path — _soft_dtw_no_grad() evaluates each anti-diagonal of the DP matrix in one batched operation, avoiding Python row/column loops. This is a significant optimization for inference without gradients.
✅ Smart dispatch — The new _soft_dtw() wrapper checks torch.is_grad_enabled() and requires_grad to choose between the autograd-preserving path and the faster no-grad path. Clean separation of concerns.
✅ Variable-length batching — _soft_dtw_variable_y_no_grad() handles padded Y sequences with different lengths by masking invalid positions with inf. This enables batching across environments with different reference trajectory lengths.
✅ Reward function batching — get_imitation_reward_from_dtw() now collects selected trajectories, pads them to max length, and calls forward_with_lengths() in one batched operation (when available and normalize=False). Falls back to grouping by length otherwise.
✅ Test coverage — test_soft_dtw_forward_with_lengths_matches_unpadded_calls() verifies the batched path produces identical results to per-element calls. Good regression test.
⚠️ Minor: Type annotations — selected_trajs_by_len and env_ids_by_len use generic dict[int, list[...]] syntax which requires Python 3.9+. IsaacLab targets 3.10+ so this is fine, but worth noting.
✅ Explicit int() casts — Tensor indexing now uses int(min_dist_traj_idx[i].item()) etc., avoiding potential type warnings with newer PyTorch versions.

Previous Concerns Addressed

Performance concern ✅ — The no-grad anti-diagonal path addresses the Python loop overhead for inference. The PR description shows the batched path achieves ~13.8x speedup on CPU and ~13.7x speedup on CUDA compared to the per-env loop.
Memory concern ✅ — The no-grad path uses a single R tensor instead of building an autograd graph, reducing memory pressure during inference.

Updated Verdict

👍 LGTM — This commit delivers on the performance optimization promise. The anti-diagonal batched DP is a well-known technique for GPU-friendly DTW, and the variable-length batching is a clean solution for AutoMate's per-env trajectory selection. All concerns from previous review have been addressed.

Updated review by isaaclab-review-bot • SHA: 1b0e2424

📝 Update (02fe40b7) — Autograd Path Rewritten with Proper Tensor Storage + Gradient Test

Changes in This Commit

File	Change
`soft_dtw_cuda.py`	Rewrote `_soft_dtw_autograd()` to use a 3D `R` tensor instead of Python lists; added bandwidth banding to autograd path; style fixes (`if 0 < bandwidth` → `if bandwidth > 0`); added proper docstring to `_soft_dtw()`
`test_automate_soft_dtw.py`	Added `test_soft_dtw_backward_produces_finite_gradients()` test; cosmetic path simplification

Findings

✅ Autograd path now uses proper tensor indexing — The previous prev_row/curr_row list-of-tensors approach is replaced with a single R tensor of shape (batch, len_x+2, len_y+2). This is cleaner and avoids potential issues with Python list references during backprop graph construction.
✅ Bandwidth banding added to autograd path — The autograd path now pre-computes j_start/j_end based on band_size instead of using continue inside the inner loop. This skips unnecessary iterations entirely, matching the optimization already present in the no-grad path.
✅ Gradient test added — test_soft_dtw_backward_produces_finite_gradients() verifies that backward() through the autograd path produces non-None, finite gradients. This directly validates the concern raised in the earlier review about autograd correctness.
✅ Style consistency — if 0 < bandwidth: → if bandwidth > 0: is a minor readability improvement applied consistently across both no-grad functions.
✅ Docstring added — _soft_dtw() now has proper argument documentation explaining the gamma and bandwidth parameters.

Previous Concerns Addressed

"No autograd backward" concern ✅ — Now explicitly tested with gradient finiteness assertion.
Memory for autograd path — The R tensor approach pre-allocates the full DP table, which uses slightly more memory than the row-by-row list approach, but provides cleaner autograd graph construction. Acceptable trade-off for correctness.

Updated Verdict

👍 LGTM — Solid incremental improvement. The autograd path is now more robust with proper tensor storage, bandwidth optimization, and an explicit gradient test. All previously raised concerns have been addressed across the PR's evolution.

Updated review by isaaclab-review-bot • SHA: 02fe40b7

StafaH · 2026-06-09T01:40:47Z

+    return module
+
+
+def test_soft_dtw_use_cuda_does_not_require_numba():


Seems like this and below test can be combined, and use parameterized ('cpu', 'cuda')

StafaH · 2026-06-09T01:45:28Z

-        )
-        E = E[:, 1 : N + 1, 1 : M + 1]
-        return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None
+def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor:


How about the below (using torch only instead of python lists, better docstring)?

def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float = -1) -> torch.Tensor: """ Compute batched SoftDTW from a pairwise distance tensor. D: Tensor of shape (batch, len_x, len_y) gamma: SoftDTW smoothing parameter. If gamma == 0, computes hard DTW. bandwidth: Optional Sakoe-Chiba bandwidth. If <= 0, no band constraint. """ batch_size, len_x, len_y = D.shape inf = torch.tensor(float("inf"), device=D.device, dtype=D.dtype) R = torch.full( (batch_size, len_x + 2, len_y + 2), inf, device=D.device, dtype=D.dtype, ) R[:, 0, 0] = 0 use_band = bandwidth > 0 bandwidth = int(bandwidth) if use_band else max(len_x, len_y) for i in range(1, len_x + 1): j_start = max(1, i - bandwidth) j_end = min(len_y, i + bandwidth) + 1 for j in range(j_start, j_end): r0 = R[:, i - 1, j - 1] r1 = R[:, i - 1, j] r2 = R[:, i, j - 1] if gamma == 0: softmin = torch.minimum(torch.minimum(r0, r1), r2) else: softmin = -gamma * torch.logsumexp( torch.stack((-r0 / gamma, -r1 / gamma, -r2 / gamma), dim=0), dim=0, ) R[:, i, j] = D[:, i - 1, j - 1] + softmin return R[:, len_x, len_y]

Backports #6040 to release/3.0.0-beta2. Beta2-specific conflict resolution: - kept beta2's existing pyproject.toml packaging shape - removed numba from source/isaaclab_tasks/setup.py, where beta2 still declares install_requires - placed the new SoftDTW test under source/isaaclab_tasks/test and updated it to load isaaclab_tasks/direct/automate/soft_dtw_cuda.py Validation: - git diff --check refs/remotes/upstream/release/3.0.0-beta2...HEAD - python3 -m py_compile source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py source/isaaclab_tasks/test/test_automate_soft_dtw.py source/isaaclab_tasks/setup.py - python3 -m pytest source/isaaclab_tasks/test/test_automate_soft_dtw.py -q could not run locally because the active system Python does not have torch installed

Constrain AutoMate Numba dependency stack

8d47fb7

github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team infrastructure labels Jun 8, 2026

isaaclab-review-bot Bot reviewed Jun 8, 2026

View reviewed changes

Remove AutoMate Numba SoftDTW dependency

d0e0cac

ooctipus changed the title ~~Constrain AutoMate Numba dependency stack~~ Remove AutoMate Numba SoftDTW dependency Jun 9, 2026

StafaH approved these changes Jun 9, 2026

View reviewed changes

ooctipus added 3 commits June 8, 2026 18:49

Optimize AutoMate Torch SoftDTW reward

1b0e242

Clean up AutoMate SoftDTW DP path

f5ccd09

Run AutoMate SoftDTW formatter

02fe40b

ooctipus merged commit dc51341 into isaac-sim:develop Jun 9, 2026
37 checks passed

ooctipus mentioned this pull request Jun 9, 2026

[release/3.0.0-beta2] Remove AutoMate Numba SoftDTW dependency #6056

Merged

ooctipus deleted the fix/automate-numba-constraints branch June 9, 2026 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove AutoMate Numba SoftDTW dependency#6040

Remove AutoMate Numba SoftDTW dependency#6040
ooctipus merged 5 commits into
isaac-sim:developfrom
ooctipus:fix/automate-numba-constraints

ooctipus commented Jun 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

StafaH Jun 9, 2026

Uh oh!

StafaH Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return module


		def test_soft_dtw_use_cuda_does_not_require_numba():

Conversation

ooctipus commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale

Verification

Performance

Uh oh!

greptile-apps Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Changes Reviewed

Findings

Verdict

New Changes

Findings on New Approach

Updated Verdict

New Changes

Findings on Performance Optimizations

Previous Concerns Addressed

Updated Verdict

Changes in This Commit

Findings

Previous Concerns Addressed

Updated Verdict

Uh oh!

StafaH Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

StafaH Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ooctipus commented Jun 8, 2026 •

edited

Loading

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading