Opinionated cross-platform performance instrumentation for Rust and Python (see bindings/python) that unifies system metrics into a single unified profiler (handles CPU, GPU, and I/O at once). It does both halves of the job: measuring your code, and visualizing the system metrics, live or asynchronously.
Annotate a function and the same scope goes to a puffin flame chart, a tracing span, and (with the cuda feature) an NVTX range. With the viz feature, vor also draws an egui panel with that flame chart, frame-rate bars, and live system and GPU metrics.
vor.mp4.mov
- Macros for functions, methods, and whole
implblocks:#[profile],#[all_functions],#[skip].const fns are left alone. - The same annotations work on native macOS, web/wasm, and NVIDIA (NVTX).
- An egui panel with frame bars, a puffin flame chart, and one line plot per metric, with pin, pause, range-select, and zoom.
- System metrics sampled for you every frame: frame time, resident memory, and per-frame I/O.
- Headless capture: set
VOR_RECORDand the same instrumentation streams system, GPU, and named metrics (plus opt-in flame frames) to a.vorfile you replay later, with no panel in the binary. - Python bindings: profile Python with
@vor.profileandrecord_metric, capturing to the same.vorstream the Rust tools read. - Live GPU metrics in the panel: Apple Silicon via IOKit and IOReport (no
sudo), NVIDIA via NVML. - Sinks that write a Chrome trace on native or push to the browser DevTools timeline on web.
vor is feature-gated, so pull in only what your platform needs.
[dependencies]
vor = { version = "0.2", features = ["viz", "mac"] }Or cargo add vor --features viz,mac.
| feature | adds |
|---|---|
| (none) | instrumentation macros plus puffin/tracing scopes (no cost until enable()) |
viz |
the egui profiler panel (vor::viz) |
mac |
macOS: ChromeTraceSink, resident-memory sampling, and the IOKit/IOReport GPU collector |
web |
wasm: BrowserSink (DevTools User Timing), JS-heap memory, browser-safe puffin |
cuda |
NVIDIA: live GPU rows via NVML, plus an NVTX range per scope for Nsight Systems |
These features are independent; combine them as needed, for example ["viz", "mac", "cuda"].
// A single function or method.
#[vor::profile]
fn render(frame: u32) { /* ... */ }
// Every method in an impl. Scopes are named Renderer::sort,
// Renderer::shade, and so on, with no per-method attribute.
struct Renderer { /* ... */ }
#[vor::all_functions]
impl Renderer {
fn sort(&self) { /* ... */ }
fn shade(&self) { /* ... */ }
// Keep a hot trivial helper out of the flame chart.
#[vor::skip]
fn dirty(&self) -> bool { /* ... */ }
}
// An ad-hoc block scope.
fn step() {
vor::profile_scope!("expensive_part");
/* ... */
}Turn collection on once, and mark a boundary per rendered frame:
fn main() {
vor::enable(); // switch puffin scope collection on
loop {
// ... your frame ...
vor::frame_mark(); // group scopes into this frame
}
}Until enable() is called the puffin half does nothing. The tracing half is always live for whatever subscriber you install.
The panel is one consumer of vor's per-frame samples; a file stream is another. For a headless job (ML training or inference, a server, a batch tool) set VOR_RECORD and the same instrumentation writes each frame_mark to an append-only .vor capture. No panel, no egui, no render loop, and nothing extra in the binary when the variable is unset.
fn main() {
vor::enable(); // arms the recorder if VOR_RECORD is set
for step in 0..steps {
train_step(); // #[vor::profile] on the hot fns inside
vor::record_metric("loss", loss); // optional named scalars
vor::frame_mark(); // one record per step
}
vor::flush_recording(); // write the tail before exit
}VOR_RECORD=/scratch/run.vor cargo run --release # capture
cargo run --release # no recording, instrumentation onlyrecord_metric(name, value) is the headless-friendly, generics-free counterpart to a panel Metric<R>: the latest value per name is snapshotted into each frame's record. Good for loss, learning rate, tokens/sec, or batch size. To label a row with a unit, call record_metric_unit(name, unit) once (e.g. at startup); metrics stay unitless otherwise.
Every metric, system or user, is a column: a (name, unit) pair with a stable id, declared once before any value references it. System columns are declared in the header; a user metric is declared the first time it appears, taking the unit registered for it. Frame records then carry values by id, so a name or unit is never repeated per frame. Each record holds one system sample, the frame's user scalars, and (opt-in) one puffin flame frame. Records are length-delimited and compressed one at a time, so a reader can tail a growing file or stop cleanly at a truncated final record left by a crashed job:
[header] [u32 len][lz4 record] [u32 len][lz4 record] ...
The default is metrics-only (tens of bytes per step lz4'd), since that time series is what a long run actually wants. Flame frames are heavier and gated behind env vars:
| variable | effect |
|---|---|
VOR_RECORD |
output path (/scratch/run.vor); unset disables recording |
VOR_RECORD_FLAME=1 |
also capture puffin flame frames (default off, metrics only) |
VOR_RECORD_EVERY=N |
capture a flame frame on 1 step in N |
VOR_RECORD_MAX_FRAMES=N |
stop capturing flame frames after N of them (metrics continue) |
Read a capture back with vor::Reader (header columns, then frames one at a time; stops at EOF or a torn trailing record, so the same code reads a finished or still-growing file):
let mut reader = vor::Reader::open("/scratch/run.vor").unwrap();
for column in reader.columns() { // system columns (name, unit)
println!("{} ({})", column.name, column.unit);
}
while let Some(frame) = reader.next_frame().unwrap() {
// frame.system aligns to reader.columns(); frame.user are (name, value) scalars,
// units in reader.user_columns(); frame.flame is a serialized puffin frame.
}next_frame returns None at EOF or a partial trailing record and keeps the buffered bytes, so the same loop reads a finished file (stop at None) or one still being written (retry after None to pick up new frames as they land).
With the viz feature, vor::viz::ReplayState renders a capture through the same frame bars, flame chart, and metric rows as the live panel, fed from the stream instead of in-process sampling:
let mut state = vor::viz::ReplayState::open("/scratch/run.vor").unwrap();
// each egui frame:
state.show(ui);With follow on (default) it tails a growing file, so you can watch a job live on the same host; off, or once the file stops growing, it is a post-mortem of the last few hundred frames. Click a bar to pause and inspect, shift-drag to zoom a frame range, and if the run captured flame frames the pinned step's flame chart fills in. examples/replay.rs wires this into a window. (Very long post-mortem runs that need scrolling past the bounded ring are future work.)
vor owns the system rows (frame_ms, memory_mb, io_ms, io_MB, and gpu_* where supported). You describe only your own per-frame workload.
use std::collections::VecDeque;
use vor::viz::{Metric, PanelConfig, PanelState, show};
#[derive(Clone, Copy)]
struct AppFrame { visible: u32 }
const fn visible_of(f: &AppFrame) -> f64 { f.visible as f64 }
const METRICS: &[Metric<AppFrame>] =
&[Metric::new("visible", visible_of, "splats").as_integer()];
let mut state = PanelState::new(PanelConfig::FRAME_MS);
let cap = PanelConfig::FRAME_MS.history_capacity;
let mut history: VecDeque<AppFrame> = VecDeque::with_capacity(cap);
// Once per displayed frame, inside your egui update. Skip the tick
// and the push while paused so every graph freezes together instead
// of scrolling under the pinned cursor:
if !state.is_paused() {
state.tick(); // sample system metrics, mark a puffin frame
if history.len() >= cap { history.pop_front(); }
history.push_back(AppFrame { visible: 1_500_000 });
}
show(ui, &mut state, &history, METRICS); // draw the panelPanelState::tick() advances vor's own system ring. Push one workload record per tick so the two stay aligned, and gate both on is_paused() as above.
The bars and every metric plot share one time axis: a pin, a zoom range, and pause apply to all of them at once.
| action | effect |
|---|---|
| click a frame bar | pin the cursor on that frame (all graphs) and pause |
| shift-drag the bars | zoom every graph to that frame range (pins the slowest frame) |
| pause/resume button | freeze / follow the live stream (PanelState::toggle_pause) |
| scroll over the flame chart | zoom the flame chart's within-frame time; drag pans, double-click resets |
| profiler chip | annotate frame_ms with vor's own per-frame cost |
vor samples these itself on each tick():
| metric | source | platforms |
|---|---|---|
frame_ms |
wall time between ticks | all |
memory_mb |
RSS on mac, performance.memory on web (Chromium) |
mac, web |
io_ms, io_MB |
your record_io(ns, bytes) calls, drained per frame |
all |
gpu_util |
IOKit IOAccelerator on mac, NVML utilization on cuda |
mac, cuda |
gpu_sm |
IOKit IOAccelerator renderer utilization |
mac |
gpu_power |
IOReport GPU Energy on mac, NVML power draw on cuda |
mac, cuda |
pcie |
NVML PCIe TX+RX | cuda |
gpu_mem |
IOKit IOAccelerator in-use memory on mac, NVML used on cuda |
mac, cuda |
gpu_temp |
NVML core temperature | cuda |
gpu_clock |
NVML SM clock | cuda |
A background thread the panel starts polls the GPU backend (mac or cuda, no sudo) and the rows show only metrics that backend supplies: gpu_sm is macOS-only (NVML has no SM-occupancy counter), while pcie, gpu_temp, and gpu_clock are NVIDIA-only (the macOS backend doesn't read them). On a platform with no backend, including the browser (which gives a web page no GPU-telemetry API), the GPU rows are dropped rather than drawn as flat zeros.
Feed I/O time from anywhere, including background threads:
vor::record_io(elapsed_ns, bytes); // lock-free accumulatorInstall a sink once at startup, then drop the returned guard to flush.
// macOS. Open the output in chrome://tracing or Perfetto.
use vor::{ChromeTraceSink, Sink};
let guard = ChromeTraceSink { path: "trace.json".into() }.install();// Web. Spans show up in the DevTools Performance tab.
use vor::{BrowserSink, Sink};
let guard = BrowserSink.install();The cuda feature does two independent things on NVIDIA hardware:
- Fills the panel's
gpu_util,pcie, andgpu_powerrows from NVML, the same waymacfills them from IOReport. - Opens an NVTX range per scope, so your instrumented code lines up on an Nsight Systems timeline next to CUDA and GPU work. No code changes are needed: the same
#[profile],#[all_functions], andprofile_scope!carry over.
Neither needs a CUDA toolkit to build. nvtx vendors its headers and compiles them with cc; nvml-wrapper loads libnvidia-ml from the driver at runtime, so the GPU rows populate on any machine with an NVIDIA driver installed.
FrameStats: an HDR histogram of per-frame nanoseconds, withp50_ns,p95_ns,p99_ns, andmean_ns.calibrate()andempty_span_ns(): measure the per-span instrumentation overhead so you can subtract it.current_memory_bytes(): process memory on supported platforms.
examples/custom_metrics.rs is headless and shows the API shape (#[profile],
#[all_functions], caller-defined metrics, the PanelState loop):
cargo run --features viz --example custom_metricsexamples/headless.rs profiles an ML-style loop with no panel, records it when
VOR_RECORD is set, and reads the capture back with vor::Reader:
VOR_RECORD=/tmp/run.vor cargo run --example headless # capture
VOR_RECORD=/tmp/run.vor VOR_RECORD_FLAME=1 cargo run --example headless --features mac
cargo run --example headless -- /tmp/run.vor # summarize the captureexamples/replay.rs opens that capture in the panel, tailing it live or
replaying it after the fact:
cargo run --example replay --features viz,mac -- /tmp/run.vorexamples/live_panel.rs opens a window and renders the live panel, so it doubles
as an end-to-end check of each platform backend. Pick the feature set for the
machine you are on:
# macOS (Apple Silicon): live gpu_util / gpu_sm / gpu_power via IOKit + IOReport
cargo run --example live_panel --features viz,mac
# NVIDIA box: live gpu_util / pcie / gpu_power via NVML, plus NVTX ranges
cargo run --example live_panel --features viz,cuda
# Web / browser: the standalone demo in web/ renders the panel in a canvas
cd examples/web && trunk serve --open # needs: cargo install trunk; rustup target add wasm32-unknown-unknown(examples/web/ is a minimal eframe + trunk app; GPU rows are absent in the browser,
so it verifies the web build, the panel, and the DevTools timeline path.)
Run the GPU smoke tests directly (each asserts the backend returns sane readings; run on the matching machine):
cargo test --features viz,mac poll_yields_sane_readings # macOS
cargo test --features viz,cuda poll_yields_sane_readings # NVIDIA hostDual-licensed under MIT or Apache-2.0.