fix: add OOV character handling to C++ runtime for Chinese TN#369
Closed
pengzhendong wants to merge 1 commit into
Closed
fix: add OOV character handling to C++ runtime for Chinese TN#369pengzhendong wants to merge 1 commit into
pengzhendong wants to merge 1 commit into
Conversation
The C++ runtime previously relied on the pre-built .fst files to handle OOV (out-of-vocabulary) characters like Korean Hangul and Japanese Kana. When the FSTs were built without tag_oov=True (the default), these characters passed through unchanged instead of being wrapped in <oov> tags. This adds Unicode-range-based OOV detection as a post-processing step in Processor::Normalize. Characters outside the CJK Unified Ideographs, ASCII, and common punctuation ranges are wrapped in <oov> tags. The check is skipped if the output already contains <oov> tags (i.e., the FST was built with OOV support), avoiding double-wrapping. Fixes #368
451f5e0 to
b006900
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Processor::Normalizemethodtag_oov=True(the default), characters outside the CJK Unified Ideographs, ASCII, and common punctuation ranges are now correctly wrapped in<oov>tags<oov>tags (when the FST was built with OOV support), avoiding double-wrappingRoot Cause
The C++ runtime test loads test data from
tn/chinese/test/data/normalizer.txt, which includes OOV test cases (Korean and Japanese characters). These test cases expect<oov>wrapping, but the pre-built.fstfiles are built withtag_oov=Falseby default, so the OOV characters pass through unchanged.The Python tests pass because they use
Normalizer(overwrite_cache=True, tag_oov=True)which rebuilds the FSTs with OOV support before testing.Test Plan
Fixes #368