Skip to content

Tensorbit-Labs/tensorbit-quant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tensorbit Quant

Third stage of the Tensorbit Labs P-D-Q pipeline (Prune → Distill → Quant → Run).

Reads an FP32 .tbm model container from tensorbit-core, quantizes weight tensors to INT4 or INT8, and writes a new .tbm with compressed weights + per-group scale metadata. Zero external dependencies — C++20 standard library only.

.tbm Container Format

┌─────────────────────────────────────────────────────────────────┐
│  Tensor 0 Blob                                                  │
│  ┌──────────┬───────────────────┬──────────┬───────┐            │
│  │ TBHeader │ Quantized weights │  Scales  │ Masks │            │
│  │ (4096 B) │ (INT4: N/2 bytes) │ (FP32)   │       │            │
│  └──────────┴───────────────────┴──────────┴───────┘            │
├─────────────────────────────────────────────────────────────────┤
│  ... more tensors ...                                           │
├─────────────────────────────────────────────────────────────────┤
│  JSON Index (UTF-8)                                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ {"name":"...","offset":...,"dtype":"int4",                │   │
│  │  "num_weights":...,"num_mask_bytes":...,                  │   │
│  │  "scale_count":8192,"group_size":128}                     │   │
│  └──────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│  4-byte LE uint32 = JSON byte length                            │
└─────────────────────────────────────────────────────────────────┘

New fields in the JSON index per tensor:

Field Type Description
dtype string "int4" or "int8"
scale_count int Number of FP32 scale values (one per group)
group_size int Elements per quant group (0 = per-tensor)
zp_count int Number of zero-point bytes (asymmetric only)

Usage

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --target tb-quant --parallel -j4

# INT4 symmetric, 128-element groups (default)
./bin/tb-quant --model model.tbm --output model.int4.tbm

# INT8 per-channel symmetric
./bin/tb-quant --model model.tbm --output model.int8.tbm --dtype int8 --group-size 0

# INT4 asymmetric
./bin/tb-quant --model model.tbm --output model.int4.tbm --scheme asymmetric

Options

Flag Default Description
--model PATH (required) Input FP32 .tbm file
--output PATH (required) Output quantized .tbm file
--dtype TYPE int4 Quantization type: int4 or int8
--scheme SCHEME symmetric symmetric or asymmetric
--group-size N 128 Elements per quant group (0 = per-tensor)
--help, -h Print help
--version Print version

Quantization Methods

Symmetric INT4 (default):

  • Group of 128 weights → find max_abs = max(|w|)
  • Scale = max_abs / 7. Each weight: q = round(w / scale), clamped to [-8, 7]
  • Two 4-bit values packed per byte (low nibble first)

Symmetric INT8 per-row:

  • One scale per matrix row. Scale = max_abs / 127
  • Each weight stored as signed 8-bit integer

Asymmetric INT4:

  • Per-group min/max. Scale = (max - min) / 15, zero-point = round(-min / scale)
  • Each weight: q = round(w / scale) + zp, clamped to [0, 15]

License

This project is dual-licensed.

  • Open source use: Licensed under the GNU AGPLv3. You may use, modify, and distribute the code under the terms of the AGPL, which requires all modifications and larger works to be licensed under the same license and requires making source code available to network users.

  • Commercial use: If you wish to use this library in a proprietary product without the copyleft obligations of the AGPL, a separate commercial license is available. Please contact us for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors