Remove AutoMate Numba SoftDTW dependency#6040
Conversation
Greptile SummaryThis PR constrains the NumPy and Numba dependency bounds to prevent incompatible pre-release stacks (
Confidence Score: 4/5The workspace fix is sound, but the numpy upper bound lives only in the uv workspace override and not in the package's own metadata, leaving standalone installs unprotected. The workspace-level numpy cap and the numba source/isaaclab_tasks/pyproject.toml — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["uv resolve\n(prerelease=allow)"] --> B{Root override-dependencies}
B -->|"Before: numpy>=2"| C["numpy==2.5.0rc1\nnumba==0.66.0rc1\nllvmlite==0.48.0rc1"]
B -->|"After: numpy>=2,<2.5"| D["numpy==2.4.6\nnumba==0.65.1\nnumba-cuda==0.30.2"]
C --> E["❌ import numba.cuda FAILS\n(Numba rejects NumPy 2.5 RC)"]
D --> F["✅ import numba.cuda OK\nsoft_dtw_cuda.py works"]
G["isaaclab_tasks/pyproject.toml\nnumba>=0.63.1,<0.66"] --> D
H["⚠️ numpy>=2 (no upper bound)\nin isaaclab_tasks pyproject"] -.->|standalone install| C
|
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
PR #6040 — Remove AutoMate Numba SoftDTW dependency
Summary
This PR adds upper-bound constraints to prevent uv from resolving pre-release Numba/NumPy stacks that break numba.cuda imports (used by AutoMate SoftDTW).
Changes Reviewed
| File | Change |
|---|---|
pyproject.toml |
Root uv override: numpy>=2 → numpy>=2,<2.5 |
source/isaaclab_tasks/pyproject.toml |
numba>=0.63.1 → numba>=0.63.1,<0.66 |
changelog.d/...rst |
New changelog fragment |
Findings
-
✅ Root cause is well-explained — The
prerelease = "allow"setting combined with the unboundednumpy>=2override was pulling in RC versions of numba, numpy, and llvmlite that are incompatible. -
✅ Constraint approach is correct — Pinning
numpy<2.5at the override level andnumba<0.66at the package level keeps the resolver on the stable 0.65.x line while still allowing future patch releases. -
⚠️ Minor: Upper bounds will need maintenance — When Numba 0.66 and NumPy 2.5 are officially released (non-RC), these caps should be revisited. Consider adding a comment inpyproject.tomlexplaining why the cap exists (e.g.,# Cap to avoid pre-release stacks; bump when numba 0.66 stable is verified). -
✅ Changelog fragment present — Follows project conventions (
changelog.d/with.rstformat, properFixedcategory). -
✅ CI checks passing — Pre-commit, license-check, wheel build, changelog check, and labeler all pass. Docker/installation tests still pending but unrelated to dependency metadata.
Verdict
👍 LGTM — Clean, minimal fix with good root-cause analysis. The constraints are appropriately scoped and the resolver dry-run verification is convincing. Only suggestion is to add inline comments for future maintainers explaining the upper bounds.
Automated review by isaaclab-review-bot • SHA: 8d47fb7
📝 Update (d0e0cac5) — Approach Changed: Numba Removed Entirely
The PR direction has shifted significantly. Instead of constraining the Numba/NumPy version stack, this commit removes the Numba dependency entirely by rewriting the SoftDTW implementation in pure PyTorch.
New Changes
| File | Change |
|---|---|
soft_dtw_cuda.py |
Replaced ~350 lines of Numba CUDA/CPU kernels + torch.autograd.Function with a ~30-line pure-Torch _soft_dtw() function using torch.logsumexp |
source/isaaclab_tasks/pyproject.toml |
Removed numba>=0.63.1,<0.66 dependency entirely |
pyproject.toml |
Reverted numpy override back to numpy>=2 (upper bound no longer needed) |
run_w_id.py |
Removed NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS=0 env var and os import |
changelog.d/...rst |
Updated to reflect the new approach |
test/contrib/test_automate_soft_dtw.py |
New — Unit tests verifying the Torch implementation works without Numba |
Findings on New Approach
-
✅ Root cause eliminated — Rather than constraining a fragile dependency chain, Numba is gone. No more numba/llvmlite/numpy version compatibility issues. This is a stronger fix.
-
✅ API preserved —
SoftDTWclass retains the same constructor signature (use_cuda,device,gamma,normalize,bandwidth,dist_func). Existing callers will not break. -
⚠️ Performance consideration — The new implementation uses Python-level loops over the DTW matrix (for i in range(1, len_x + 1): for j in range(1, len_y + 1):). For short sequences this is fine, but for long sequences this will be significantly slower than the previous Numba CUDA kernel. Worth noting in the changelog or docstring if large sequences are expected. -
⚠️ No autograd backward — The previous implementation had explicitbackward()viatorch.autograd.Function. The new_soft_dtw()relies on PyTorch's autograd graph through the standard ops (logsumexp,stack, tensor arithmetic). This should work correctly but may use more memory for long sequences due to the computational graph. -
✅ Tests added — Three test cases covering: no-Numba requirement, hard DTW correctness (gamma=0), and normalized mode (identical sequences → 0). Good coverage for a replacement.
-
✅
gamma=0handled — Falls back totorch.minimumfor hard DTW, avoiding division by zero.
Updated Verdict
👍 LGTM — This is a better approach than version pinning. Eliminating Numba removes the fragile dependency chain entirely and simplifies the build. The pure-Torch implementation is clean, well-tested, and API-compatible. The only trade-off is potential performance regression for very long sequences, which is acceptable given AutoMate's typical use case.
Previous concern about "upper bounds needing maintenance" is now resolved — there are no upper bounds to maintain.
Updated review by isaaclab-review-bot • SHA: d0e0cac5
📝 Update (1b0e2424) — Performance Optimizations for Batched SoftDTW
This commit addresses the performance concerns raised in the previous review by adding optimized inference paths and batched variable-length SoftDTW computation.
New Changes
| File | Change |
|---|---|
soft_dtw_cuda.py |
Added _soft_dtw_no_grad() (anti-diagonal batched DP), _soft_dtw_variable_y_no_grad() (variable-length Y), and forward_with_lengths() method |
automate_algo_utils.py |
Refactored reward computation to batch SoftDTW calls instead of per-env loop |
test_automate_soft_dtw.py |
Added test for forward_with_lengths() correctness |
Findings on Performance Optimizations
-
✅ Anti-diagonal DP path —
_soft_dtw_no_grad()evaluates each anti-diagonal of the DP matrix in one batched operation, avoiding Python row/column loops. This is a significant optimization for inference without gradients. -
✅ Smart dispatch — The new
_soft_dtw()wrapper checkstorch.is_grad_enabled()andrequires_gradto choose between the autograd-preserving path and the faster no-grad path. Clean separation of concerns. -
✅ Variable-length batching —
_soft_dtw_variable_y_no_grad()handles padded Y sequences with different lengths by masking invalid positions withinf. This enables batching across environments with different reference trajectory lengths. -
✅ Reward function batching —
get_imitation_reward_from_dtw()now collects selected trajectories, pads them to max length, and callsforward_with_lengths()in one batched operation (when available andnormalize=False). Falls back to grouping by length otherwise. -
✅ Test coverage —
test_soft_dtw_forward_with_lengths_matches_unpadded_calls()verifies the batched path produces identical results to per-element calls. Good regression test. -
⚠️ Minor: Type annotations —selected_trajs_by_lenandenv_ids_by_lenuse genericdict[int, list[...]]syntax which requires Python 3.9+. IsaacLab targets 3.10+ so this is fine, but worth noting. -
✅ Explicit
int()casts — Tensor indexing now usesint(min_dist_traj_idx[i].item())etc., avoiding potential type warnings with newer PyTorch versions.
Previous Concerns Addressed
-
Performance concern ✅ — The no-grad anti-diagonal path addresses the Python loop overhead for inference. The PR description shows the batched path achieves ~13.8x speedup on CPU and ~13.7x speedup on CUDA compared to the per-env loop.
-
Memory concern ✅ — The no-grad path uses a single
Rtensor instead of building an autograd graph, reducing memory pressure during inference.
Updated Verdict
👍 LGTM — This commit delivers on the performance optimization promise. The anti-diagonal batched DP is a well-known technique for GPU-friendly DTW, and the variable-length batching is a clean solution for AutoMate's per-env trajectory selection. All concerns from previous review have been addressed.
Updated review by isaaclab-review-bot • SHA: 1b0e2424
📝 Update (02fe40b7) — Autograd Path Rewritten with Proper Tensor Storage + Gradient Test
Changes in This Commit
| File | Change |
|---|---|
soft_dtw_cuda.py |
Rewrote _soft_dtw_autograd() to use a 3D R tensor instead of Python lists; added bandwidth banding to autograd path; style fixes (if 0 < bandwidth → if bandwidth > 0); added proper docstring to _soft_dtw() |
test_automate_soft_dtw.py |
Added test_soft_dtw_backward_produces_finite_gradients() test; cosmetic path simplification |
Findings
-
✅ Autograd path now uses proper tensor indexing — The previous
prev_row/curr_rowlist-of-tensors approach is replaced with a singleRtensor of shape(batch, len_x+2, len_y+2). This is cleaner and avoids potential issues with Python list references during backprop graph construction. -
✅ Bandwidth banding added to autograd path — The autograd path now pre-computes
j_start/j_endbased onband_sizeinstead of usingcontinueinside the inner loop. This skips unnecessary iterations entirely, matching the optimization already present in the no-grad path. -
✅ Gradient test added —
test_soft_dtw_backward_produces_finite_gradients()verifies thatbackward()through the autograd path produces non-None, finite gradients. This directly validates the concern raised in the earlier review about autograd correctness. -
✅ Style consistency —
if 0 < bandwidth:→if bandwidth > 0:is a minor readability improvement applied consistently across both no-grad functions. -
✅ Docstring added —
_soft_dtw()now has proper argument documentation explaining thegammaandbandwidthparameters.
Previous Concerns Addressed
- "No autograd backward" concern ✅ — Now explicitly tested with gradient finiteness assertion.
- Memory for autograd path — The
Rtensor approach pre-allocates the full DP table, which uses slightly more memory than the row-by-row list approach, but provides cleaner autograd graph construction. Acceptable trade-off for correctness.
Updated Verdict
👍 LGTM — Solid incremental improvement. The autograd path is now more robust with proper tensor storage, bandwidth optimization, and an explicit gradient test. All previously raised concerns have been addressed across the PR's evolution.
Updated review by isaaclab-review-bot • SHA: 02fe40b7
| return module | ||
|
|
||
|
|
||
| def test_soft_dtw_use_cuda_does_not_require_numba(): |
There was a problem hiding this comment.
Seems like this and below test can be combined, and use parameterized ('cpu', 'cuda')
| ) | ||
| E = E[:, 1 : N + 1, 1 : M + 1] | ||
| return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None | ||
| def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor: |
There was a problem hiding this comment.
How about the below (using torch only instead of python lists, better docstring)?
def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float = -1) -> torch.Tensor:
"""
Compute batched SoftDTW from a pairwise distance tensor.
D: Tensor of shape (batch, len_x, len_y)
gamma: SoftDTW smoothing parameter. If gamma == 0, computes hard DTW.
bandwidth: Optional Sakoe-Chiba bandwidth. If <= 0, no band constraint.
"""
batch_size, len_x, len_y = D.shape
inf = torch.tensor(float("inf"), device=D.device, dtype=D.dtype)
R = torch.full(
(batch_size, len_x + 2, len_y + 2),
inf,
device=D.device,
dtype=D.dtype,
)
R[:, 0, 0] = 0
use_band = bandwidth > 0
bandwidth = int(bandwidth) if use_band else max(len_x, len_y)
for i in range(1, len_x + 1):
j_start = max(1, i - bandwidth)
j_end = min(len_y, i + bandwidth) + 1
for j in range(j_start, j_end):
r0 = R[:, i - 1, j - 1]
r1 = R[:, i - 1, j]
r2 = R[:, i, j - 1]
if gamma == 0:
softmin = torch.minimum(torch.minimum(r0, r1), r2)
else:
softmin = -gamma * torch.logsumexp(
torch.stack((-r0 / gamma, -r1 / gamma, -r2 / gamma), dim=0),
dim=0,
)
R[:, i, j] = D[:, i - 1, j - 1] + softmin
return R[:, len_x, len_y]
Backports #6040 to release/3.0.0-beta2. Beta2-specific conflict resolution: - kept beta2's existing pyproject.toml packaging shape - removed numba from source/isaaclab_tasks/setup.py, where beta2 still declares install_requires - placed the new SoftDTW test under source/isaaclab_tasks/test and updated it to load isaaclab_tasks/direct/automate/soft_dtw_cuda.py Validation: - git diff --check refs/remotes/upstream/release/3.0.0-beta2...HEAD - python3 -m py_compile source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py source/isaaclab_tasks/test/test_automate_soft_dtw.py source/isaaclab_tasks/setup.py - python3 -m pytest source/isaaclab_tasks/test/test_automate_soft_dtw.py -q could not run locally because the active system Python does not have torch installed
Summary
numbafromisaaclab_tasksdependencies.forward_with_lengths(...)so the AutoMate reward evaluates padded variable-length reference segments in one batched call instead of one SoftDTW call per environment.run_w_id.py.Rationale
The original failure is not a sustainable place to solve with a NumPy pin. AutoMate only needs SoftDTW values for reward computation; it does not require the copied differentiable Numba implementation as a package-level dependency. Keeping Numba also exposes a second failure mode on RTX 5090: the old Numba CUDA kernel can fail at compile time with
CUDA_ERROR_UNSUPPORTED_PTX_VERSION/ unsupported PTX version.This removes the dependency instead of constraining global NumPy resolution.
Verification
python -m pytest source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q(5 passed).5 passed).git diff --checkpasses.py_compilepasses for the touched Python files.gamma={0.01,0.1,1.0}, normalized/non-normalized valid cases, bandwidth{None,2,20}, and sequence lengths up toB=8,N=10,M=100; max absolute difference was1.526e-04.B=128,N=10,M=100, it measured82.497 mson CUDA versus14.038 msfor the anti-diagonal no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward inference while using the cleaner Torch DP-table style for autograd.gamma=0, the old implementation returnsnanon a simple hard-DTW case; the new implementation returns the expected hard-DTW value1.0.128envs,10robot waypoints,ref_len=100,gamma=0.01); max absolute error was0.0on CPU and CUDA.Performance
Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch
2.10.0+cu128,128envs,10robot waypoints,ref_len=100,gamma=0.01,no_grad:141.617 ms355.131 ms15.372 MB13.483 ms25.824 ms15.372 MBThe peak CUDA allocation in this reward benchmark is dominated by the closest-state
torch.cdistcalculation, not by the SoftDTW table.The previous Numba CUDA path could not be timed on this RTX 5090 because it fails locally with
CUDA_ERROR_UNSUPPORTED_PTX_VERSION; the performance comparison above is against the direct per-env Torch replacement path that this PR would otherwise have used.