Skip to content

Remove AutoMate Numba SoftDTW dependency#6040

Merged
ooctipus merged 5 commits into
isaac-sim:developfrom
ooctipus:fix/automate-numba-constraints
Jun 9, 2026
Merged

Remove AutoMate Numba SoftDTW dependency#6040
ooctipus merged 5 commits into
isaac-sim:developfrom
ooctipus:fix/automate-numba-constraints

Conversation

@ooctipus

@ooctipus ooctipus commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Remove numba from isaaclab_tasks dependencies.
  • Replace AutoMate's Numba CUDA/CPU-JIT SoftDTW helper with a Torch implementation that runs on the input tensor device.
  • Add a no-grad anti-diagonal SoftDTW path plus forward_with_lengths(...) so the AutoMate reward evaluates padded variable-length reference segments in one batched call instead of one SoftDTW call per environment.
  • Clean up the autograd SoftDTW path to use a Torch DP table instead of Python row lists, with a clearer docstring.
  • Remove the Numba CUDA warning environment variable from run_w_id.py.
  • Add focused SoftDTW tests for the no-Numba path, hard DTW, normalized SoftDTW, variable-length padded SoftDTW, and finite backward gradients.

Rationale

The original failure is not a sustainable place to solve with a NumPy pin. AutoMate only needs SoftDTW values for reward computation; it does not require the copied differentiable Numba implementation as a package-level dependency. Keeping Numba also exposes a second failure mode on RTX 5090: the old Numba CUDA kernel can fail at compile time with CUDA_ERROR_UNSUPPORTED_PTX_VERSION / unsupported PTX version.

This removes the dependency instead of constraining global NumPy resolution.

Verification

  • Focused tests pass in the develop venv: python -m pytest source/isaaclab_tasks/test/contrib/test_automate_soft_dtw.py -q (5 passed).
  • Focused tests pass in the beta2 venv where Numba import is broken (5 passed).
  • git diff --check passes.
  • py_compile passes for the touched Python files.
  • Old-vs-new SoftDTW CPU forward parity: 594 finite cases across gamma={0.01,0.1,1.0}, normalized/non-normalized valid cases, bandwidth {None,2,20}, and sequence lengths up to B=8,N=10,M=100; max absolute difference was 1.526e-04.
  • Mustafa's row/column Torch DP variant matched the current no-grad implementation exactly in direct forward checks; for B=128,N=10,M=100, it measured 82.497 ms on CUDA versus 14.038 ms for the anti-diagonal no-grad SoftDTW path, so this PR keeps the anti-diagonal path for reward inference while using the cleaner Torch DP-table style for autograd.
  • For gamma=0, the old implementation returns nan on a simple hard-DTW case; the new implementation returns the expected hard-DTW value 1.0.
  • New SoftDTW autograd smoke test produces finite gradients.
  • AutoMate reward parity: optimized length-aware reward path matches the original per-env reward loop on synthetic AutoMate-shaped data (128 envs, 10 robot waypoints, ref_len=100, gamma=0.01); max absolute error was 0.0 on CPU and CUDA.

Performance

Synthetic AutoMate-shaped reward benchmark on RTX 5090 with Torch 2.10.0+cu128, 128 envs, 10 robot waypoints, ref_len=100, gamma=0.01, no_grad:

Path CPU median CUDA median CUDA peak allocated delta
Per-env Torch reward loop 141.617 ms 355.131 ms 15.372 MB
Batched length-aware Torch reward 13.483 ms 25.824 ms 15.372 MB

The peak CUDA allocation in this reward benchmark is dominated by the closest-state torch.cdist calculation, not by the SoftDTW table.

The previous Numba CUDA path could not be timed on this RTX 5090 because it fails locally with CUDA_ERROR_UNSUPPORTED_PTX_VERSION; the performance comparison above is against the direct per-env Torch replacement path that this PR would otherwise have used.

@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team infrastructure labels Jun 8, 2026
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR constrains the NumPy and Numba dependency bounds to prevent incompatible pre-release stacks (numba==0.66.0rc1 + numpy==2.5.0rc1) from being resolved under prerelease = "allow", which was breaking soft_dtw_cuda.py CUDA imports in AutoMate training.

  • The root pyproject.toml uv workspace override is tightened from numpy>=2 to numpy>=2,<2.5, blocking NumPy 2.5.x project-wide in the workspace.
  • source/isaaclab_tasks/pyproject.toml pins Numba to <0.66, keeping the resolved stack at numba==0.65.1 + numpy==2.4.6.
  • A changelog fragment is added under isaaclab_tasks/changelog.d/ in the correct RST format.

Confidence Score: 4/5

The workspace fix is sound, but the numpy upper bound lives only in the uv workspace override and not in the package's own metadata, leaving standalone installs unprotected.

The workspace-level numpy cap and the numba <0.66 bound together correctly block the broken pre-release stack for anyone using the workspace. The gap is that isaaclab_tasks/pyproject.toml still advertises numpy>=2 without an upper bound, so a direct non-workspace install can still resolve the incompatible combination. The changelog fragment and overall approach are correct.

source/isaaclab_tasks/pyproject.toml — the numpy>=2 dependency line is missing the <2.5 upper bound that was added to the workspace override.

Important Files Changed

Filename Overview
pyproject.toml Root uv workspace override tightened from numpy>=2 to numpy>=2,<2.5 to block pre-release NumPy 2.5 that Numba 0.65 rejects; effective for workspace installs only.
source/isaaclab_tasks/pyproject.toml Numba capped at <0.66 to pin to the 0.65 line, but numpy>=2 is still unbounded, leaving standalone installs unprotected from the NumPy 2.5 incompatibility.
source/isaaclab_tasks/changelog.d/fix-automate-numba-constraints.rst New changelog fragment added in the correct changelog.d/ directory with the correct RST section structure matching existing entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["uv resolve\n(prerelease=allow)"] --> B{Root override-dependencies}
    B -->|"Before: numpy>=2"| C["numpy==2.5.0rc1\nnumba==0.66.0rc1\nllvmlite==0.48.0rc1"]
    B -->|"After: numpy>=2,<2.5"| D["numpy==2.4.6\nnumba==0.65.1\nnumba-cuda==0.30.2"]
    C --> E["❌ import numba.cuda FAILS\n(Numba rejects NumPy 2.5 RC)"]
    D --> F["✅ import numba.cuda OK\nsoft_dtw_cuda.py works"]
    G["isaaclab_tasks/pyproject.toml\nnumba>=0.63.1,<0.66"] --> D
    H["⚠️ numpy>=2 (no upper bound)\nin isaaclab_tasks pyproject"] -.->|standalone install| C
Loading

Comments Outside Diff (1)

  1. source/isaaclab_tasks/pyproject.toml, line 20 (link)

    P1 The numpy>=2 lower-bound-only constraint in the package's own pyproject.toml is not paired with the <2.5 upper bound that was added to the root uv workspace override. A user who installs isaaclab_tasks directly (e.g. pip install isaaclab-tasks or uv add isaaclab-tasks outside this workspace) will still be able to resolve numba==0.65.x together with numpy==2.5.x, recreating exactly the broken stack this PR is fixing. The numpy cap should live in the package metadata, not only in the workspace [tool.uv] override.

Reviews (1): Last reviewed commit: "Constrain AutoMate Numba dependency stac..." | Re-trigger Greptile

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

PR #6040 — Remove AutoMate Numba SoftDTW dependency

Summary

This PR adds upper-bound constraints to prevent uv from resolving pre-release Numba/NumPy stacks that break numba.cuda imports (used by AutoMate SoftDTW).

Changes Reviewed

File Change
pyproject.toml Root uv override: numpy>=2numpy>=2,<2.5
source/isaaclab_tasks/pyproject.toml numba>=0.63.1numba>=0.63.1,<0.66
changelog.d/...rst New changelog fragment

Findings

  1. Root cause is well-explained — The prerelease = "allow" setting combined with the unbounded numpy>=2 override was pulling in RC versions of numba, numpy, and llvmlite that are incompatible.

  2. Constraint approach is correct — Pinning numpy<2.5 at the override level and numba<0.66 at the package level keeps the resolver on the stable 0.65.x line while still allowing future patch releases.

  3. ⚠️ Minor: Upper bounds will need maintenance — When Numba 0.66 and NumPy 2.5 are officially released (non-RC), these caps should be revisited. Consider adding a comment in pyproject.toml explaining why the cap exists (e.g., # Cap to avoid pre-release stacks; bump when numba 0.66 stable is verified).

  4. Changelog fragment present — Follows project conventions (changelog.d/ with .rst format, proper Fixed category).

  5. CI checks passing — Pre-commit, license-check, wheel build, changelog check, and labeler all pass. Docker/installation tests still pending but unrelated to dependency metadata.

Verdict

👍 LGTM — Clean, minimal fix with good root-cause analysis. The constraints are appropriately scoped and the resolver dry-run verification is convincing. Only suggestion is to add inline comments for future maintainers explaining the upper bounds.


Automated review by isaaclab-review-bot • SHA: 8d47fb7


📝 Update (d0e0cac5) — Approach Changed: Numba Removed Entirely

The PR direction has shifted significantly. Instead of constraining the Numba/NumPy version stack, this commit removes the Numba dependency entirely by rewriting the SoftDTW implementation in pure PyTorch.

New Changes

File Change
soft_dtw_cuda.py Replaced ~350 lines of Numba CUDA/CPU kernels + torch.autograd.Function with a ~30-line pure-Torch _soft_dtw() function using torch.logsumexp
source/isaaclab_tasks/pyproject.toml Removed numba>=0.63.1,<0.66 dependency entirely
pyproject.toml Reverted numpy override back to numpy>=2 (upper bound no longer needed)
run_w_id.py Removed NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS=0 env var and os import
changelog.d/...rst Updated to reflect the new approach
test/contrib/test_automate_soft_dtw.py New — Unit tests verifying the Torch implementation works without Numba

Findings on New Approach

  1. Root cause eliminated — Rather than constraining a fragile dependency chain, Numba is gone. No more numba/llvmlite/numpy version compatibility issues. This is a stronger fix.

  2. API preservedSoftDTW class retains the same constructor signature (use_cuda, device, gamma, normalize, bandwidth, dist_func). Existing callers will not break.

  3. ⚠️ Performance consideration — The new implementation uses Python-level loops over the DTW matrix (for i in range(1, len_x + 1): for j in range(1, len_y + 1):). For short sequences this is fine, but for long sequences this will be significantly slower than the previous Numba CUDA kernel. Worth noting in the changelog or docstring if large sequences are expected.

  4. ⚠️ No autograd backward — The previous implementation had explicit backward() via torch.autograd.Function. The new _soft_dtw() relies on PyTorch's autograd graph through the standard ops (logsumexp, stack, tensor arithmetic). This should work correctly but may use more memory for long sequences due to the computational graph.

  5. Tests added — Three test cases covering: no-Numba requirement, hard DTW correctness (gamma=0), and normalized mode (identical sequences → 0). Good coverage for a replacement.

  6. gamma=0 handled — Falls back to torch.minimum for hard DTW, avoiding division by zero.

Updated Verdict

👍 LGTM — This is a better approach than version pinning. Eliminating Numba removes the fragile dependency chain entirely and simplifies the build. The pure-Torch implementation is clean, well-tested, and API-compatible. The only trade-off is potential performance regression for very long sequences, which is acceptable given AutoMate's typical use case.

Previous concern about "upper bounds needing maintenance" is now resolved — there are no upper bounds to maintain.


Updated review by isaaclab-review-bot • SHA: d0e0cac5


📝 Update (1b0e2424) — Performance Optimizations for Batched SoftDTW

This commit addresses the performance concerns raised in the previous review by adding optimized inference paths and batched variable-length SoftDTW computation.

New Changes

File Change
soft_dtw_cuda.py Added _soft_dtw_no_grad() (anti-diagonal batched DP), _soft_dtw_variable_y_no_grad() (variable-length Y), and forward_with_lengths() method
automate_algo_utils.py Refactored reward computation to batch SoftDTW calls instead of per-env loop
test_automate_soft_dtw.py Added test for forward_with_lengths() correctness

Findings on Performance Optimizations

  1. Anti-diagonal DP path_soft_dtw_no_grad() evaluates each anti-diagonal of the DP matrix in one batched operation, avoiding Python row/column loops. This is a significant optimization for inference without gradients.

  2. Smart dispatch — The new _soft_dtw() wrapper checks torch.is_grad_enabled() and requires_grad to choose between the autograd-preserving path and the faster no-grad path. Clean separation of concerns.

  3. Variable-length batching_soft_dtw_variable_y_no_grad() handles padded Y sequences with different lengths by masking invalid positions with inf. This enables batching across environments with different reference trajectory lengths.

  4. Reward function batchingget_imitation_reward_from_dtw() now collects selected trajectories, pads them to max length, and calls forward_with_lengths() in one batched operation (when available and normalize=False). Falls back to grouping by length otherwise.

  5. Test coveragetest_soft_dtw_forward_with_lengths_matches_unpadded_calls() verifies the batched path produces identical results to per-element calls. Good regression test.

  6. ⚠️ Minor: Type annotationsselected_trajs_by_len and env_ids_by_len use generic dict[int, list[...]] syntax which requires Python 3.9+. IsaacLab targets 3.10+ so this is fine, but worth noting.

  7. Explicit int() casts — Tensor indexing now uses int(min_dist_traj_idx[i].item()) etc., avoiding potential type warnings with newer PyTorch versions.

Previous Concerns Addressed

  • Performance concern ✅ — The no-grad anti-diagonal path addresses the Python loop overhead for inference. The PR description shows the batched path achieves ~13.8x speedup on CPU and ~13.7x speedup on CUDA compared to the per-env loop.

  • Memory concern ✅ — The no-grad path uses a single R tensor instead of building an autograd graph, reducing memory pressure during inference.

Updated Verdict

👍 LGTM — This commit delivers on the performance optimization promise. The anti-diagonal batched DP is a well-known technique for GPU-friendly DTW, and the variable-length batching is a clean solution for AutoMate's per-env trajectory selection. All concerns from previous review have been addressed.


Updated review by isaaclab-review-bot • SHA: 1b0e2424


📝 Update (02fe40b7) — Autograd Path Rewritten with Proper Tensor Storage + Gradient Test

Changes in This Commit

File Change
soft_dtw_cuda.py Rewrote _soft_dtw_autograd() to use a 3D R tensor instead of Python lists; added bandwidth banding to autograd path; style fixes (if 0 < bandwidthif bandwidth > 0); added proper docstring to _soft_dtw()
test_automate_soft_dtw.py Added test_soft_dtw_backward_produces_finite_gradients() test; cosmetic path simplification

Findings

  1. Autograd path now uses proper tensor indexing — The previous prev_row/curr_row list-of-tensors approach is replaced with a single R tensor of shape (batch, len_x+2, len_y+2). This is cleaner and avoids potential issues with Python list references during backprop graph construction.

  2. Bandwidth banding added to autograd path — The autograd path now pre-computes j_start/j_end based on band_size instead of using continue inside the inner loop. This skips unnecessary iterations entirely, matching the optimization already present in the no-grad path.

  3. Gradient test addedtest_soft_dtw_backward_produces_finite_gradients() verifies that backward() through the autograd path produces non-None, finite gradients. This directly validates the concern raised in the earlier review about autograd correctness.

  4. Style consistencyif 0 < bandwidth:if bandwidth > 0: is a minor readability improvement applied consistently across both no-grad functions.

  5. Docstring added_soft_dtw() now has proper argument documentation explaining the gamma and bandwidth parameters.

Previous Concerns Addressed

  • "No autograd backward" concern ✅ — Now explicitly tested with gradient finiteness assertion.
  • Memory for autograd path — The R tensor approach pre-allocates the full DP table, which uses slightly more memory than the row-by-row list approach, but provides cleaner autograd graph construction. Acceptable trade-off for correctness.

Updated Verdict

👍 LGTM — Solid incremental improvement. The autograd path is now more robust with proper tensor storage, bandwidth optimization, and an explicit gradient test. All previously raised concerns have been addressed across the PR's evolution.


Updated review by isaaclab-review-bot • SHA: 02fe40b7

@ooctipus ooctipus changed the title Constrain AutoMate Numba dependency stack Remove AutoMate Numba SoftDTW dependency Jun 9, 2026
return module


def test_soft_dtw_use_cuda_does_not_require_numba():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this and below test can be combined, and use parameterized ('cpu', 'cuda')

)
E = E[:, 1 : N + 1, 1 : M + 1]
return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None
def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float) -> torch.Tensor:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the below (using torch only instead of python lists, better docstring)?

def _soft_dtw(D: torch.Tensor, gamma: float, bandwidth: float = -1) -> torch.Tensor:
    """
    Compute batched SoftDTW from a pairwise distance tensor.

    D: Tensor of shape (batch, len_x, len_y)
    gamma: SoftDTW smoothing parameter. If gamma == 0, computes hard DTW.
    bandwidth: Optional Sakoe-Chiba bandwidth. If <= 0, no band constraint.
    """
    batch_size, len_x, len_y = D.shape

    inf = torch.tensor(float("inf"), device=D.device, dtype=D.dtype)

    R = torch.full(
        (batch_size, len_x + 2, len_y + 2),
        inf,
        device=D.device,
        dtype=D.dtype,
    )

    R[:, 0, 0] = 0

    use_band = bandwidth > 0
    bandwidth = int(bandwidth) if use_band else max(len_x, len_y)

    for i in range(1, len_x + 1):
        j_start = max(1, i - bandwidth)
        j_end = min(len_y, i + bandwidth) + 1

        for j in range(j_start, j_end):
            r0 = R[:, i - 1, j - 1]
            r1 = R[:, i - 1, j]
            r2 = R[:, i, j - 1]

            if gamma == 0:
                softmin = torch.minimum(torch.minimum(r0, r1), r2)
            else:
                softmin = -gamma * torch.logsumexp(
                    torch.stack((-r0 / gamma, -r1 / gamma, -r2 / gamma), dim=0),
                    dim=0,
                )

            R[:, i, j] = D[:, i - 1, j - 1] + softmin

    return R[:, len_x, len_y]

@ooctipus ooctipus merged commit dc51341 into isaac-sim:develop Jun 9, 2026
37 checks passed
@ooctipus ooctipus deleted the fix/automate-numba-constraints branch June 9, 2026 03:20
AntoineRichard pushed a commit that referenced this pull request Jun 9, 2026
Backports #6040 to release/3.0.0-beta2.

Beta2-specific conflict resolution:
- kept beta2's existing pyproject.toml packaging shape
- removed numba from source/isaaclab_tasks/setup.py, where beta2 still
declares install_requires
- placed the new SoftDTW test under source/isaaclab_tasks/test and
updated it to load isaaclab_tasks/direct/automate/soft_dtw_cuda.py

Validation:
- git diff --check refs/remotes/upstream/release/3.0.0-beta2...HEAD
- python3 -m py_compile
source/isaaclab_tasks/isaaclab_tasks/direct/automate/automate_algo_utils.py
source/isaaclab_tasks/isaaclab_tasks/direct/automate/run_w_id.py
source/isaaclab_tasks/isaaclab_tasks/direct/automate/soft_dtw_cuda.py
source/isaaclab_tasks/test/test_automate_soft_dtw.py
source/isaaclab_tasks/setup.py
- python3 -m pytest source/isaaclab_tasks/test/test_automate_soft_dtw.py
-q could not run locally because the active system Python does not have
torch installed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants