Skip to content

feat(runtime): file-lock the TRT-RTX runtime cache#4237

Draft
tp5uiuc wants to merge 3 commits intopytorch:mainfrom
tp5uiuc:feat/runtime-cache-file-lock
Draft

feat(runtime): file-lock the TRT-RTX runtime cache#4237
tp5uiuc wants to merge 3 commits intopytorch:mainfrom
tp5uiuc:feat/runtime-cache-file-lock

Conversation

@tp5uiuc
Copy link
Copy Markdown
Contributor

@tp5uiuc tp5uiuc commented May 6, 2026

Summary

Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp}) wired into load_runtime_cache (shared) and save_runtime_cache_impl (exclusive) so the Python and C++ TRT-RTX runtimes sharing a runtime_cache_path do not race the rename and silently drop compiled kernels. Follow-up to a reviewer ask on #4202 to land locking as a separate pass.

Backend

Matches the filelock Python library so the two interop on the same <cache>.lock file:

  • Linux/macOS: flock(2) — not POSIX fcntl(F_SETLK), which lives in an independent namespace and would silently fail to interop on Linux.
  • Windows: LockFileEx on byte range (0, 1) — matches msvcrt.locking(..., 1) on the Python side.

flock(2) has no native timeout, so try_lock_for is a 50ms-cadence poll loop with a 10s default matching filelock's acquire(timeout=10). Errors propagate via TORCHTRT_CHECK; the existing try/catch in ensure_initialized and the noexcept save_runtime_cache wrapper preserve behavior on contention.

Tests

  • C++ (tests/cpp/test_file_lock.cpp): 12 gtest cases — exclusive/shared/mixed contention, timeout edges, RAII release, move semantics, no-unlink-on-release, and a same-namespace `flock(2)` interop check.
  • Python (tests/py/dynamo/runtime/test_000_runtime_cache.py): parameterizes `test_filelock_works` and `test_sequential_save_load` over both runtimes; adds `test_python_lock_blocks_cpp_save` (Python `filelock` blocks C++ save → timeout, cache unchanged, post-release save succeeds) and `test_filelock_cross_runtime_parallel` (two subprocesses, one per runtime, on a shared `cache_path`).

Local (A100, RTX): `bazel test //tests/cpp:test_file_lock` → 12/12 pass; full `test_000_runtime_cache.py` → all pass + RTX-gated skips.

Notes

Test plan

  • C++ unit tests pass (`bazel test //tests/cpp:test_file_lock`)
  • Python e2e pass (`pytest tests/py/dynamo/runtime/test_000_runtime_cache.py`)
  • Pre-commit clean (clang-format, black, isort, ruff, buildifier, typos)

tp5uiuc added 3 commits May 6, 2026 02:01
…y, and native CUDA graph support to C++ runtime

- Introduce IRuntimeConfig scaffolding and bump ABI to v9
- Add runtime cache to C++ runtime for TensorRT-RTX
- Add dynamic shapes kernel specialization strategy to C++ runtime
- Add TensorRT-RTX native CUDA graph strategy to C++ runtime
- Extract TRTRuntimeConfig
- Consolidate C++ runtime tests and add model-level coverage
…xecution_context

release_nccl_comm() previously rebuilt the IExecutionContext via direct
calls to ICudaEngine::createExecutionContext, bypassing the TRTRuntimeConfig
plumbing introduced earlier in this PR. On that path the RTX runtime cache
was not flushed before context teardown, and the dynamic shapes kernel
specialization and CUDA graph strategies stored on TRTRuntimeConfig were
not re-applied to the new context.

Delegate to recreate_execution_context() instead. It saves the runtime
cache, ensures TRTRuntimeConfig is initialized, sets the allocation
strategy from resource_allocation_strategy, and creates the new exec
context via createExecutionContext(runtime_cfg.config.get()), keeping all
strategies live across the NCCL bind/release cycle.
…safety

Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp})
matching py-filelock's lock-file convention so the Python and C++ runtimes
sharing a runtime_cache_path do not race the rename and silently drop
compiled kernels.

- Unix backend uses BSD flock(2) -- the primitive py-filelock uses, not
  POSIX fcntl record locks (which live in an independent namespace and
  would silently fail to interop on Linux).
- Windows backend uses LockFileEx on byte (0,1) -- matches the byte range
  msvcrt.locking(..., 1) locks on the Python side.
- Platform branch is hidden behind a LockHandle struct with move-and-swap
  semantics, so callers only see a single FileLock RAII type.
- Shared/exclusive modes: load takes shared (multiple readers OK), save
  takes exclusive. Python's FileLock is exclusive-only but conflicts
  correctly against C++ shared holders since both use the flock namespace.
- 10s acquire timeout via 50ms-cadence poll loop, matching the Python
  side's timeout=10. Lock-file path is <cache_path>.lock.

Wired into load_runtime_cache and save_runtime_cache_impl, with the
FileLock scoped to just the I/O block (save writes in-place under the
lock, no tmp+rename). Errors propagate via TORCHTRT_CHECK; the existing
try/catch in ensure_initialized and the noexcept save_runtime_cache
wrapper catch and log, so external behavior on contention is unchanged.

Tests:
- tests/cpp/test_file_lock.cpp: 12 unit tests covering exclusive/shared
  contention, timeout edges, RAII release, move semantics, no-unlink-on-
  release, and a same-namespace flock(2) interop check that verifies the
  C++ primitive conflicts with raw flock locks (what py-filelock uses).
- tests/py/dynamo/runtime/test_000_runtime_cache.py:
  - parameterizes test_filelock_works and test_sequential_save_load over
    both runtimes
  - test_python_lock_blocks_cpp_save: an externally-held py-filelock causes
    the C++ save to time out silently, leaving the cache file unmodified;
    a fresh save after release succeeds
  - test_filelock_cross_runtime_parallel: two subprocesses (one Python-
    runtime, one C++-runtime) compile against a shared cache_path and both
    succeed. Subprocesses rather than threads because torch.export has
    thread-unsafe TLS, but cross-process is the real-world locking
    scenario anyway.
@meta-cla meta-cla Bot added the cla signed label May 6, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels May 6, 2026
@github-actions github-actions Bot requested a review from narendasan May 6, 2026 10:53
@narendasan narendasan requested review from apbose and zewenli98 May 6, 2026 16:25
Comment thread core/runtime/runtime.h
REQUIRES_OUTPUT_ALLOCATOR_IDX,
RESOURCE_ALLOCATION_STRATEGY_IDX,
REQUIRES_NATIVE_MULTIDEVICE_IDX,
#ifdef TRT_MAJOR_RTX
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not ifdef the serialization format because users might accidentally cross packages here. We can make optional slots with a sentinel value. But TRT produced programs and TRT-RTX programs should share the format

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also are these properties of the engine or are they runtime mode configurations? The point of this interface is the bare minimum information to reconstruct the program from disk


TRTEngine::TRTEngine(
const std::string& serialized_engine,
std::string serialized_engine,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be a deep copy?

std::tuple("resource_allocation_strategy", serialized_info[RESOURCE_ALLOCATION_STRATEGY_IDX]),
std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX]));
std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX])
#ifdef TRT_MAJOR_RTX
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment

this->resource_allocation_strategy == ResourceAllocationStrategy::kDynamic ? "1" : "0";
serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX] = this->requires_native_multidevice ? "1" : "0";
// rank/world_size are runtime facts (may differ at load time); not serialized.
#ifdef TRT_MAJOR_RTX
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

torch.ops.tensorrt.SERIALIZATION_LEN()
) # 15 (RTX) / 12 (standard)

_DYNAMIC_SHAPES_KERNEL_STRATEGY_MAP: Dict[str, int] = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these things that need user apis?

Comment on lines +79 to +82
_CUDA_GRAPH_STRATEGY_MAP: Dict[str, int] = {
"disabled": 0,
"whole_graph_capture": 1,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above?

autocast_calibration_dataloader (Optional[torch.utils.data.DataLoader]): The dataloader to use for autocast calibration. Default is None.
offload_module_to_cpu (bool): Offload the model to CPU to reduce memory footprint during compilation
dynamically_allocate_resources (bool): Dynamically allocate resources for TensorRT engines
cuda_graph_strategy (str): TensorRT-RTX CUDA graph strategy: "disabled" (default) or "whole_graph_capture" (let TensorRT-RTX manage CUDA graph capture/replay internally). When set and combined with `torch_tensorrt.runtime.set_cudagraphs_mode(True)` on RTX, overrides manual capture. Not used for standard TensorRT.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime mode controls should be controlled via context managers rather than passed in at compile time. Only information that is fixed at runtime needs to be here

@@ -0,0 +1,187 @@
#include <atomic>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move into //tests/core/runtime or //tests/core/util

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants