Skip to content

fix: preserve spaces as token boundaries in English TN tagger#365

Merged
pengzhendong merged 1 commit into
masterfrom
fix/en-tn-space-boundary
Jun 11, 2026
Merged

fix: preserve spaces as token boundaries in English TN tagger#365
pengzhendong merged 1 commit into
masterfrom
fix/en-tn-space-boundary

Conversation

@pengzhendong

@pengzhendong pengzhendong commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • Refactor English TN normalizer tagger to preserve input spaces as token boundaries instead of deleting all spaces before tagging
  • Remove single-letter f from unit_alternatives.tsv (Fahrenheit should use °F, not bare f)
  • Lower range tagger weight to 1.0 so "4x" is matched as range ("four times") rather than serial ("four x")
  • Lower fraction tagger weight to 0.99 to preserve "3/4" as "three quarters"

Background

Issue wenet-e2e/wetext#15 reported "4x faster" → "four times degrees Fahrenheit aster" — a major normalization failure. The root cause was the tagger deleting all input spaces (delete(" ").star + tagger.star) before matching, which destroyed token boundaries. Without spaces, "4xfaster" was matched as a single measure token where 4x → cardinal+range ("four times") and f → unit "degree Fahrenheit".

The fix follows NeMo's approach: preserve spaces as boundaries between classified tokens, so each rule can only consume up to a space boundary and cannot cross into adjacent words.

Test plan

  • python3 -m tn --language en --text "4x faster" → "four times faster" (not "four times degrees Fahrenheit aster")
  • python3 -m tn --language en --text "4x" → "four times"
  • python3 -m tn --language en --text "3/4" → "three quarters"
  • python3 -m tn --language en --text "hello, world" → "hello, world"
  • python3 -m tn --language en --text "4°F" → "four degrees Fahrenheit"
  • pytest tn/english/test/ — 114 passed
  • pytest itn/english/test/ — 487 passed (no regression)

The English TN normalizer previously deleted all input spaces before
tagging (delete(" ").star + tagger.star), which destroyed token
boundaries and caused "4x faster" to be misparsed as a single measure
token producing "four times degrees Fahrenheit aster" (wenet-e2e/wetext#15).

Refactor the tagger composition to preserve spaces as token boundaries
(NeMo-style), using closure(punct) + classify + closure(punct) as
token units with delete(SPACE) | punct as inter-token separators.
Also remove single-letter f from unit_alternatives.tsv since 4°F
is the correct Fahrenheit input format, not 4f. Lower range tagger
weight to 1.0 so "4x" is matched as range ("four times") rather than
serial ("four x"), and fraction to 0.99 to preserve "3/4" as "three
quarters".
@pengzhendong pengzhendong force-pushed the fix/en-tn-space-boundary branch from 8be2f91 to 6192b47 Compare June 11, 2026 06:25
@pengzhendong pengzhendong merged commit e5a76f6 into master Jun 11, 2026
1 check passed
@pengzhendong pengzhendong deleted the fix/en-tn-space-boundary branch June 11, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant