feat: English ITN with full rule coverage#358
Merged
Conversation
- Add Money rule: two dollars => $2, one cent => $0.01 - Fix Time: require suffix for hour+minute, zero-pad hours, restrict to valid hour range (0-23) to avoid date conflicts - Fix Decimal: add quantity support (five point two million => 5.2 million) - Fix Money cents: pad single-digit cents (1 => 01) - Extend _num_to_word to support 60-99 NeMo English ITN: 372/470 (79%) All 1442 unit tests pass.
- decimal: add cardinal+quantity support (63/63 full pass) - time: add no-suffix hour+minute, quarter/half to, timezone (28/29) - money: add cents padding, quantity, decimal format (43/52) - measure: add compound units mph, sq ft, kgf/cm² (112/112 full pass) - word: support apostrophes and trailing punctuation (54/55) - cardinal: add 0-12 exception (consistent with NeMo) - Fix token_parser ITN_ORDERS for time zone and money quantity
ead71ee to
f1b10b9
Compare
- Rewrite Electronic rule: require 'at' for email or dot-separated domain, preventing false matches on plain text - Add compound units to measurements.tsv (mph, sq ft, kgf/cm²) NeMo coverage: 436/470 (93%) Full pass: decimal(63), measure(112), ordinal(34)
- Money: add with_hundred pattern (one fifty five => $155), exclude thousand from quantity, fix fifteen thousand dollars => $15000 - Telephone: add double digit support in IP addresses - Update test cases to match improved coverage (450 cases)
- Date: add decades pattern (nineteen eighties => 1980s) - Telephone: increase serial weight to reduce false matches - Telephone: add double digit support in IP - Update test cases (451 cases)
Replace tagger.star with NeMo-style token + closure(delete_extra_space + token) pattern. This ensures explicit space consumption between tokens, resolving many segmentation ambiguities: - seven eleven stores => 7-eleven stores (whitelist now wins) - set alarm at ten to eleven pm => set alarm at 10:50 p.m.
- Time: fix minute_to composition (use raw digits without zero-padding) => time now 29/29 full pass - Telephone: fix IP to support single+two_digit combinations (one twenty three dot... => 123.123.0.40) - Cardinal: expose graph_two_digit for telephone serial
- cardinal: fix zero in exception list - date: add Q2 quarter, 750BC, 3-digit year, decades => 36/36 full pass - time: fix date vs time priority => 29/29 full pass - whitelist: fixed via date priority => 12/12 full pass - telephone: fix serial two_digit weight, IP combinations - 7 full-pass rules: ordinal, decimal, measure, date, time, whitelist, money(51/52)
- electronic: exclude "dot" as email username first token
- money: reject singular "one" with plural currency ("one dollars")
- telephone: add credit card 4-6-4/4-6-5 formats with optional country code
- telephone: exclude "a" as serial first char to avoid "a thirty six" -> "a36"
- punctuation: add Punctuation class, split punct from words ("twenty!" -> "20 !")
4ad2a70 to
e242041
Compare
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete English ITN implementation with full NeMo rule coverage.
NeMo test coverage: 469/470 (99.8%)
Rules implemented:
Remaining 1 failure:
Test plan