Skip to content

fix: add OOV character handling to C++ runtime for Chinese TN#369

Closed
pengzhendong wants to merge 1 commit into
masterfrom
fix/cpp-runtime-oov-handling
Closed

fix: add OOV character handling to C++ runtime for Chinese TN#369
pengzhendong wants to merge 1 commit into
masterfrom
fix/cpp-runtime-oov-handling

Conversation

@pengzhendong

@pengzhendong pengzhendong commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds Unicode-range-based OOV (out-of-vocabulary) character detection to the C++ runtime Processor::Normalize method
  • When the FST is built without tag_oov=True (the default), characters outside the CJK Unified Ideographs, ASCII, and common punctuation ranges are now correctly wrapped in <oov> tags
  • The check is skipped if the output already contains <oov> tags (when the FST was built with OOV support), avoiding double-wrapping

Root Cause

The C++ runtime test loads test data from tn/chinese/test/data/normalizer.txt, which includes OOV test cases (Korean and Japanese characters). These test cases expect <oov> wrapping, but the pre-built .fst files are built with tag_oov=False by default, so the OOV characters pass through unchanged.

The Python tests pass because they use Normalizer(overwrite_cache=True, tag_oov=True) which rebuilds the FSTs with OOV support before testing.

Test Plan

  • C++ processor_test: 48/48 passed (both with and without OOV support in FSTs)
  • C++ string_test: 4/4 passed
  • C++ token_parser_test: 7/7 passed
  • Python full test suite: 1890/1890 passed

Fixes #368

The C++ runtime previously relied on the pre-built .fst files to handle
OOV (out-of-vocabulary) characters like Korean Hangul and Japanese Kana.
When the FSTs were built without tag_oov=True (the default), these
characters passed through unchanged instead of being wrapped in <oov> tags.

This adds Unicode-range-based OOV detection as a post-processing step in
Processor::Normalize. Characters outside the CJK Unified Ideographs,
ASCII, and common punctuation ranges are wrapped in <oov> tags. The check
is skipped if the output already contains <oov> tags (i.e., the FST was
built with OOV support), avoiding double-wrapping.

Fixes #368
@pengzhendong pengzhendong force-pushed the fix/cpp-runtime-oov-handling branch from 451f5e0 to b006900 Compare June 15, 2026 06:45
@pengzhendong pengzhendong deleted the fix/cpp-runtime-oov-handling branch June 15, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runtime unit-test(processor_test) failed

1 participant