Third stage of the Tensorbit Labs P-D-Q pipeline (Prune → Distill → Quant → Run).
Reads an FP32 .tbm model container from tensorbit-core, quantizes weight tensors
to INT4 or INT8, and writes a new .tbm with compressed weights + per-group scale
metadata. Zero external dependencies — C++20 standard library only.
┌─────────────────────────────────────────────────────────────────┐
│ Tensor 0 Blob │
│ ┌──────────┬───────────────────┬──────────┬───────┐ │
│ │ TBHeader │ Quantized weights │ Scales │ Masks │ │
│ │ (4096 B) │ (INT4: N/2 bytes) │ (FP32) │ │ │
│ └──────────┴───────────────────┴──────────┴───────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ... more tensors ... │
├─────────────────────────────────────────────────────────────────┤
│ JSON Index (UTF-8) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ {"name":"...","offset":...,"dtype":"int4", │ │
│ │ "num_weights":...,"num_mask_bytes":..., │ │
│ │ "scale_count":8192,"group_size":128} │ │
│ └──────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ 4-byte LE uint32 = JSON byte length │
└─────────────────────────────────────────────────────────────────┘
New fields in the JSON index per tensor:
| Field | Type | Description |
|---|---|---|
dtype |
string | "int4" or "int8" |
scale_count |
int | Number of FP32 scale values (one per group) |
group_size |
int | Elements per quant group (0 = per-tensor) |
zp_count |
int | Number of zero-point bytes (asymmetric only) |
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --target tb-quant --parallel -j4
# INT4 symmetric, 128-element groups (default)
./bin/tb-quant --model model.tbm --output model.int4.tbm
# INT8 per-channel symmetric
./bin/tb-quant --model model.tbm --output model.int8.tbm --dtype int8 --group-size 0
# INT4 asymmetric
./bin/tb-quant --model model.tbm --output model.int4.tbm --scheme asymmetric| Flag | Default | Description |
|---|---|---|
--model PATH |
(required) | Input FP32 .tbm file |
--output PATH |
(required) | Output quantized .tbm file |
--dtype TYPE |
int4 |
Quantization type: int4 or int8 |
--scheme SCHEME |
symmetric |
symmetric or asymmetric |
--group-size N |
128 |
Elements per quant group (0 = per-tensor) |
--help, -h |
Print help | |
--version |
Print version |
Symmetric INT4 (default):
- Group of 128 weights → find
max_abs = max(|w|) - Scale =
max_abs / 7. Each weight:q = round(w / scale), clamped to [-8, 7] - Two 4-bit values packed per byte (low nibble first)
Symmetric INT8 per-row:
- One scale per matrix row. Scale =
max_abs / 127 - Each weight stored as signed 8-bit integer
Asymmetric INT4:
- Per-group min/max. Scale =
(max - min) / 15, zero-point =round(-min / scale) - Each weight:
q = round(w / scale) + zp, clamped to [0, 15]
This project is dual-licensed.
-
Open source use: Licensed under the GNU AGPLv3. You may use, modify, and distribute the code under the terms of the AGPL, which requires all modifications and larger works to be licensed under the same license and requires making source code available to network users.
-
Commercial use: If you wish to use this library in a proprietary product without the copyleft obligations of the AGPL, a separate commercial license is available. Please contact us for details.