Skip to content

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110

Open
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer
Open

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2a of the OPENNLP-1850 stack. Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), as requested in review.

Self-contained: the Unicode-conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data, the official WordBreakTest.txt conformance suite (1944/1944), and the Unicode data LICENSE/NOTICE/rat-excludes.

WordBreakProperty and ExtendedPictographic load their data lazily and recoverably (double-checked accessor, no classpath resource I/O in a static {} block), per the same review point as on the foundation — so a resource the loader cannot see is a catchable exception at call time, not a class-poisoning ExceptionInInitializerError.

Base: OPENNLP-1850-1b-alignment (#1109). Stack: 1a → 1b → 2a (this) → 2b → 2c → DL → docs.

@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 Both points on the tokenizer PR are addressed.

Resource loading (done). WordBreakProperty and ExtendedPictographic no longer load in a static {} block — each now loads lazily on first use via a double-checked accessor, so a resource the loader can't see is a catchable exception at call time rather than an ExceptionInInitializerError that poisons the class (and would otherwise take the whole WordSegmenter/WordTokenizer down). The getResourceAsStream in WordBoundaryConformanceTest is left as-is (test-only).

Split into 2a / 2b / 2c (done). Along the three concepts you identified:

#1105 (DL) now bases on #1112; I closed #1104 pointing here. Each layer builds and tests green on its own.

@krickert krickert marked this pull request as ready for review June 25, 2026 11:28
@rzo1

rzo1 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
  • Loader symmetry: WordBreakProperty.parse fails loud on a missing ;, while ExtendedPictographic.parse silently treats a line with no ; as whole-content, and only the former has a malformed-data/missing-resource test. Worth making the two read and test alike.
  • WordType.of fixes a token's script from its first script code point only; fine for single-script UAX#29 segments but a heuristic for mixed-script runs. A one-line javadoc note would help.

@krickert krickert force-pushed the OPENNLP-1850-1b-alignment branch from b24c9ee to 702acc5 Compare June 25, 2026 17:24
@krickert krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from 3fae8aa to f2d1d8c Compare June 25, 2026 17:25
@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 Both addressed (tip f2d1d8cc).

Loader symmetry. I kept the two loaders deliberately different but documented why, and closed the test gap. The difference is real rather than an oversight: WordBreakProperty.txt always has a code ; property shape, so a missing ; is corruption and fails loud. ExtendedPictographic.txt is a filtered single-property file (only Extended_Pictographic, with the property column stripped), so a line with no ; is the normal, well-formed case — the code points are taken whole. Forcing it to fail on a missing ; would reject valid data; I added a comment on ExtendedPictographic.parse spelling that out. For the test asymmetry: ExtendedPictographic.parse is now package-visible and has a malformed-data test (parseFailsLoudOnMalformedHex) asserting that a non-hex code-point column fails loud with IllegalArgumentException naming the resource — the same fail-loud contract WordBreakProperty already had.

WordType.of leading-script heuristic. Added a note on WordType.of: the script category is taken from the first script code point in the range; UAX #29 word segments are single-script in practice, so for an unusual mixed-script run this reports the leading script rather than a per-character determination.

@rzo1 rzo1 requested review from atarora, jzonthemtn and mawiesne June 26, 2026 13:08
krickert added 3 commits June 27, 2026 08:18
…rdType (2a)

Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b),
and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant
WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic
data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the
official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on
the alignment layer in 1b.
WordBreakProperty.parse threw an opaque StringIndexOutOfBoundsException on a non-comment line with
no ';' (substring(0, -1)), unlike the sibling ExtendedPictographic.parse which guards it. It now
throws IllegalStateException naming the offending line. Exposed parse() package-visibly;
WordBreakPropertyTest proves the red->green. Real Word_Break data still loads (conformance suite).
…; WordType heuristic note (tokenizer)

ExtendedPictographic.parse now fails loud (IllegalArgumentException naming the line) on malformed hex,
matching the sibling loaders, with a comment explaining why its value column is optional (unlike
WordBreakProperty); added a malformed-data test. Noted in WordType.of's comment that the script
category comes from the first script code point (single-script UAX #29 segments; leading-script for
mixed runs).
@krickert krickert force-pushed the OPENNLP-1850-1b-alignment branch from 702acc5 to 2bed555 Compare June 27, 2026 12:29
@krickert krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from f2d1d8c to 9c8e3fc Compare June 27, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants