OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) by krickert · Pull Request #1110 · apache/opennlp

krickert · 2026-06-23T15:18:40Z

Part 2a of the OPENNLP-1850 stack. Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), as requested in review.

Self-contained: the Unicode-conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data, the official WordBreakTest.txt conformance suite (1944/1944), and the Unicode data LICENSE/NOTICE/rat-excludes.

WordBreakProperty and ExtendedPictographic load their data lazily and recoverably (double-checked accessor, no classpath resource I/O in a static {} block), per the same review point as on the foundation — so a resource the loader cannot see is a catchable exception at call time, not a class-poisoning ExceptionInInitializerError.

Base: OPENNLP-1850-1b-alignment (#1109). Stack: 1a → 1b → 2a (this) → 2b → 2c → DL → docs.

krickert · 2026-06-25T11:20:33Z

@rzo1 Both points on the tokenizer PR are addressed.

Resource loading (done). WordBreakProperty and ExtendedPictographic no longer load in a static {} block — each now loads lazily on first use via a double-checked accessor, so a resource the loader can't see is a catchable exception at call time rather than an ExceptionInInitializerError that poisons the class (and would otherwise take the whole WordSegmenter/WordTokenizer down). The getResourceAsStream in WordBoundaryConformanceTest is left as-is (test-only).

Split into 2a / 2b / 2c (done). Along the three concepts you identified:

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) #1110 — UAX OPENNLP-910: Add checkstyle #29 tokenizer (WordSegmenter/WordTokenizer/WordType/WordToken/WordBreak/WordBreakProperty/ExtendedPictographic + bundled data + the conformance suite). Self-contained, the bulk of the work.
OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) #1111 — Term model (Term/TermAnalyzer), on 2a.
OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) #1112 — NormalizationProfile registry, on 2b.

#1105 (DL) now bases on #1112; I closed #1104 pointing here. Each layer builds and tests green on its own.

rzo1 · 2026-06-25T12:08:10Z

Loader symmetry: WordBreakProperty.parse fails loud on a missing ;, while ExtendedPictographic.parse silently treats a line with no ; as whole-content, and only the former has a malformed-data/missing-resource test. Worth making the two read and test alike.
WordType.of fixes a token's script from its first script code point only; fine for single-script UAX#29 segments but a heuristic for mixed-script runs. A one-line javadoc note would help.

krickert · 2026-06-25T18:16:04Z

@rzo1 Both addressed (tip f2d1d8cc).

Loader symmetry. I kept the two loaders deliberately different but documented why, and closed the test gap. The difference is real rather than an oversight: WordBreakProperty.txt always has a code ; property shape, so a missing ; is corruption and fails loud. ExtendedPictographic.txt is a filtered single-property file (only Extended_Pictographic, with the property column stripped), so a line with no ; is the normal, well-formed case — the code points are taken whole. Forcing it to fail on a missing ; would reject valid data; I added a comment on ExtendedPictographic.parse spelling that out. For the test asymmetry: ExtendedPictographic.parse is now package-visible and has a malformed-data test (parseFailsLoudOnMalformedHex) asserting that a non-hex code-point column fails loud with IllegalArgumentException naming the resource — the same fail-loud contract WordBreakProperty already had.

WordType.of leading-script heuristic. Added a note on WordType.of: the script category is taken from the first script code point in the range; UAX #29 word segments are single-script in practice, so for an unusual mixed-script run this reports the leading script rather than a per-character determination.

…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.

WordBreakProperty.parse threw an opaque StringIndexOutOfBoundsException on a non-comment line with no ';' (substring(0, -1)), unlike the sibling ExtendedPictographic.parse which guards it. It now throws IllegalStateException naming the offending line. Exposed parse() package-visibly; WordBreakPropertyTest proves the red->green. Real Word_Break data still loads (conformance suite).

…; WordType heuristic note (tokenizer) ExtendedPictographic.parse now fails loud (IllegalArgumentException naming the line) on malformed hex, matching the sibling loaders, with a comment explaining why its value column is optional (unlike WordBreakProperty); added a malformed-data test. Noted in WordType.of's comment that the script category comes from the first script code point (single-script UAX #29 segments; leading-script for mixed runs).

This was referenced Jun 23, 2026

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) #1111

Open

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 08de0d3 to 9af6d92 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from a450069 to dc02b9e Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 9af6d92 to 9dc7d51 Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dc02b9e to dd1906d Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 9dc7d51 to b24c9ee Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dd1906d to 3fae8aa Compare June 25, 2026 08:26

krickert marked this pull request as ready for review June 25, 2026 11:28

krickert force-pushed the OPENNLP-1850-1b-alignment branch from b24c9ee to 702acc5 Compare June 25, 2026 17:24

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from 3fae8aa to f2d1d8c Compare June 25, 2026 17:25

rzo1 requested review from atarora, jzonthemtn and mawiesne June 26, 2026 13:08

krickert added 3 commits June 27, 2026 08:18

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 702acc5 to 2bed555 Compare June 27, 2026 12:29

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from f2d1d8c to 9c8e3fc Compare June 27, 2026 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer

krickert commented Jun 23, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

krickert commented Jun 23, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants