Skip to content

feat: English ITN with full rule coverage#358

Merged
pengzhendong merged 13 commits into
masterfrom
feat/en-itn-full
Jun 10, 2026
Merged

feat: English ITN with full rule coverage#358
pengzhendong merged 13 commits into
masterfrom
feat/en-itn-full

Conversation

@pengzhendong

@pengzhendong pengzhendong commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Complete English ITN implementation with full NeMo rule coverage.

NeMo test coverage: 469/470 (99.8%)

Rules implemented:

Rule Coverage Notes
Cardinal 28/29 0-12 exception, up to sextillion
Ordinal 34/34 Full pass
Decimal 63/63 Full pass, with quantity
Measure 112/112 Full pass, compound units
Date 36/36 Full pass
Time 29/29 Full pass
Money 52/52 Full pass
Electronic 25/25 Full pass
Telephone 23/23 Full pass
Whitelist 12/12 Full pass
Word 55/55 Full pass

Remaining 1 failure:

  • cardinal: Indian format (crore/lakh) not implemented

Test plan

  • All 469 project unit tests pass
  • NeMo ITN test suite: 469/470
  • CI passes

- Add Money rule: two dollars => $2, one cent => $0.01
- Fix Time: require suffix for hour+minute, zero-pad hours, restrict
  to valid hour range (0-23) to avoid date conflicts
- Fix Decimal: add quantity support (five point two million => 5.2 million)
- Fix Money cents: pad single-digit cents (1 => 01)
- Extend _num_to_word to support 60-99

NeMo English ITN: 372/470 (79%)
All 1442 unit tests pass.
- decimal: add cardinal+quantity support (63/63 full pass)
- time: add no-suffix hour+minute, quarter/half to, timezone (28/29)
- money: add cents padding, quantity, decimal format (43/52)
- measure: add compound units mph, sq ft, kgf/cm² (112/112 full pass)
- word: support apostrophes and trailing punctuation (54/55)
- cardinal: add 0-12 exception (consistent with NeMo)
- Fix token_parser ITN_ORDERS for time zone and money quantity
@pengzhendong pengzhendong changed the title feat: integrate all English ITN rules into normalizer feat: English ITN with full rule coverage Jun 9, 2026
- Rewrite Electronic rule: require 'at' for email or dot-separated
  domain, preventing false matches on plain text
- Add compound units to measurements.tsv (mph, sq ft, kgf/cm²)

NeMo coverage: 436/470 (93%)
Full pass: decimal(63), measure(112), ordinal(34)
- Money: add with_hundred pattern (one fifty five => $155), exclude
  thousand from quantity, fix fifteen thousand dollars => $15000
- Telephone: add double digit support in IP addresses
- Update test cases to match improved coverage (450 cases)
- Date: add decades pattern (nineteen eighties => 1980s)
- Telephone: increase serial weight to reduce false matches
- Telephone: add double digit support in IP
- Update test cases (451 cases)
Replace tagger.star with NeMo-style token + closure(delete_extra_space
+ token) pattern. This ensures explicit space consumption between
tokens, resolving many segmentation ambiguities:
- seven eleven stores => 7-eleven stores (whitelist now wins)
- set alarm at ten to eleven pm => set alarm at 10:50 p.m.
- Time: fix minute_to composition (use raw digits without zero-padding)
  => time now 29/29 full pass
- Telephone: fix IP to support single+two_digit combinations
  (one twenty three dot... => 123.123.0.40)
- Cardinal: expose graph_two_digit for telephone serial
- cardinal: fix zero in exception list
- date: add Q2 quarter, 750BC, 3-digit year, decades => 36/36 full pass
- time: fix date vs time priority => 29/29 full pass
- whitelist: fixed via date priority => 12/12 full pass
- telephone: fix serial two_digit weight, IP combinations
- 7 full-pass rules: ordinal, decimal, measure, date, time, whitelist, money(51/52)
- electronic: exclude "dot" as email username first token
- money: reject singular "one" with plural currency ("one dollars")
- telephone: add credit card 4-6-4/4-6-5 formats with optional country code
- telephone: exclude "a" as serial first char to avoid "a thirty six" -> "a36"
- punctuation: add Punctuation class, split punct from words ("twenty!" -> "20 !")
@pengzhendong pengzhendong merged commit eabb5b9 into master Jun 10, 2026
1 check passed
@pengzhendong pengzhendong deleted the feat/en-itn-full branch June 10, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant