release: 0.32.4 — streaming detokenization preserves word spaces by michalharakal · Pull Request #767 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-26T18:42:45Z

Merges the 0.32.4 release back into develop. The tag 0.32.4 is published to Maven Central (deployment succeeded); this brings the release commit onto develop so the version (0.32.4) and CHANGELOG/docs reflect the release.

What 0.32.4 ships

Fix: streaming detokenization preserves word-boundary spaces. A generation loop that decodes one token at a time (decode(tokenId)) ran words together ("the process" → "theprocess") because the single-token path delegated to the sequence-level SentencePieceTokenizer.decode(IntArray), whose addSpacePrefix leading-space strip is only correct once per sequence.

Tokenizer.decodeToken(id) — new interface method, default = decode(intArrayOf(id)) (backward-compatible); gives the upstream tokenizer the streaming single-token decode it lacked.
SentencePieceTokenizer overrides decodeToken to decode without the leading strip (llama.cpp token_to_piece semantics); adds decode(ids, stripLeadingSpace). decode(IntArray) behaviour unchanged.
Tests: streaming decode keeps spaces; batch still strips once; regression guard for the old "Helloworld" behaviour.

Fixes correct-but-spaceless output in every streaming consumer (kllama, agent loops, any decode(Int) caller).

Notes

skainet-io-core is not API-tracked, so no .api dump change is needed.
antora.yml is version: ~ (branch-tracked).

Tokenizer.decodeToken(id): per-token streaming decode that keeps each SentencePiece piece's leading word-boundary space (llama.cpp token_to_piece semantics), so a generation loop decoding one token at a time no longer runs words together ("the process" -> "theprocess"). SentencePieceTokenizer overrides it to skip the sequence-level addSpacePrefix strip; adds decode(ids, stripLeadingSpace). Backward-compatible (decode(IntArray) unchanged). Version bump + CHANGELOG/README/docs version snippets -> 0.32.4. antora.yml is version: ~ (branch-tracked). skainet-io-core is not API-tracked, no dump change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-26T18:46:00Z

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

Operator documentation: docs/modules/operators/_generated_/
JSON schema output: operators.json

Artifacts:

Download the documentation-preview-767 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

michalharakal merged commit 740dcc4 into develop Jun 26, 2026
10 checks passed

michalharakal deleted the release/0.32.4 branch June 26, 2026 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release: 0.32.4 — streaming detokenization preserves word spaces#767

release: 0.32.4 — streaming detokenization preserves word spaces#767
michalharakal merged 1 commit into
developfrom
release/0.32.4

michalharakal commented Jun 26, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Jun 26, 2026

What 0.32.4 ships

Notes

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant