Skip to content

release(0.32.1): streaming detokenization keeps word-boundary spaces#200

Merged
michalharakal merged 1 commit into
developfrom
release/0.32.1
Jun 26, 2026
Merged

release(0.32.1): streaming detokenization keeps word-boundary spaces#200
michalharakal merged 1 commit into
developfrom
release/0.32.1

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Per-token streaming decode no longer runs words together ("the process", not "theprocess"). SentencePieceSpecialTokens.decode(Int) and UpstreamTokenizerAdapter.decode(Int) route through engine 0.32.4's Tokenizer.decodeToken(id), which preserves each SentencePiece piece's leading space instead of applying the sequence-level addSpacePrefix strip per token. Adds SentencePieceSpecialTokensStreamingTest.

  • Engine pin: skainet 0.32.2 -> 0.32.4 (adds Tokenizer.decodeToken)
  • Version: 0.32.0 -> 0.32.1
  • README / CHANGELOG / antora docs bumped to 0.32.1 (engine 0.32.4)

Per-token streaming decode no longer runs words together ("the process",
not "theprocess"). SentencePieceSpecialTokens.decode(Int) and
UpstreamTokenizerAdapter.decode(Int) route through engine 0.32.4's
Tokenizer.decodeToken(id), which preserves each SentencePiece piece's
leading space instead of applying the sequence-level addSpacePrefix strip
per token. Adds SentencePieceSpecialTokensStreamingTest.

- Engine pin: skainet 0.32.2 -> 0.32.4 (adds Tokenizer.decodeToken)
- Version: 0.32.0 -> 0.32.1
- README / CHANGELOG / antora docs bumped to 0.32.1 (engine 0.32.4)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 393b75a into develop Jun 26, 2026
6 checks passed
@michalharakal michalharakal deleted the release/0.32.1 branch June 26, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant