Skip to content

mkbabb/parse-that

Repository files navigation

parse[that]

Parser combinators for TypeScript and Rust.

Write your grammar in BBNF (Better Backus-Naur Form) .

Handles left recursion and left factoring. Performance-focused with an emphasis on readability.

Usage

TypeScript:

import { string, regex } from "@mkbabb/parse-that";

const heyy = regex(/hey+t/);
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"

Rust:

use parse_that::string;

let heyy = string("heyyyyyyyyyt");
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"

Domain parsers ship with the library:

import { jsonParser, csvParser } from "@mkbabb/parse-that";

jsonParser().parse('{"key": [1, 2, 3]}');   // combinator-based
csvParser().parse('a,b,c\n1,2,3');          // RFC 4180

Or with a grammar (via bbnf-lang):

import { generateParserFromBBNF } from "@mkbabb/bbnf-lang";

const grammar = `
    expr = term, { ("+" | "-"), term };
    term = factor, { ("*" | "/"), factor };
    factor = number | "(", expr, ")";
    number = /[0-9]+/;
`;

const [nonterminals, ast] = generateParserFromBBNF(grammar);
const expr = nonterminals.expr;
expr.parse("1 + 2 * 3"); // => [1, "+", [2, "*", 3]]

Table of Contents

Structure

typescript/            TS library (@mkbabb/parse-that v0.8.2)
  src/parse/           Isomorphic module layout (see below)
  test/                Vitest tests + benchmark comparators
rust/                  Rust workspace (nightly, edition 2024)
  parse_that/          Core lib — isomorphic module layout (see below)
  src/                 CLI binary (parse_that_cli)
grammar/               Shared test vectors (BBNF grammars in bbnf-lang repo)
  tests/json/          Shared JSON test vectors
  tests/css/           CSS recovery test vectors
  tests/debug/         Shared diagnostic output vectors
docs/                  Perf chronicles, API reference

Both languages share a mirrored module structure:

Module TypeScript (src/parse/) Rust (parse_that/src/)
Core parser.ts — Parser<T> class parse.rs — Parser<'a, O> struct
Combinators methods on Parser class (incl. recover) combinators/ — methods + seq!/alt! macros
Leaf parsers leaf.ts — string, regex, dispatch leaf.rs — string, regex, dispatch_byte
Lazy eval lazy.ts — lazy(), getLazyParser lazy.rs — LazyParser, lazy()
Span / zero-copy span.ts — regexSpan, manySpan, altSpan, takeUntilAnySpan span_parser/ — SpanParser enum + methods
Balanced splitting split.ts — splitBalanced split.rs — split_balanced
State state.ts — ParserState, Span state.rs — ParserState, Span
Debug debug.ts — diagnostics, ANSI output debug.rs — diagnostics (feature-gated)
Domain parsers parsers/ — JSON, CSV, CSS parsers/ — JSON + scanners, CSV, CSS

Performance

All benchmarks run on Apple M-series (AArch64). JSON parsing with full DOM materialization and string escape decoding. Higher is better.

Rust

MB/s throughput. bencher crate with black_box on inputs and b.bytes set.

JSON matrix (cold per-parse, MB/s)

All BBNF numbers use BumpSlab with monolithic codegen (fresh slab + parser per iteration). Competitors construct per-iteration.

Parser data.json twitter citm_catalog canada data_xl
sonic-rs 2,323 2,515 3,037 1,521 2,646
simd-json 1,460 1,538 1,267 487 1,584
jiter 1,257 1,004 986 556 1,308
serde_json_borrow 1,165 1,292 1,268 617 1,196
BBNF span 1,250 1,372 1,627 1,112 850
BBNF borrow 1,019 1,156 1,257 598 632
BBNF copy 887 900 1,173 595 567
nom 576 496 607 391 601
serde_json 576 549 851 559 602
winnow 524 525 581 390 582
pest 255 222 250 154 249

parse_that uses SIMD string scanning (memchr2), integer fast path, Vec objects (no HashMap), Cow<str> zero-copy strings, and #[cold] escape decoding.

The BBNF AOT parser uses #[derive(Parser)] from a .bbnf grammar file—zero hand-written Rust. All benchmarks use mimalloc. The monolithic slab codegen emits direct recursive functions with BumpSlab allocation, dispatch-byte elimination, IIFE elision, and B.1 Span collapse.

See docs/perf-optimization-rust.md for the full optimization chronicle.

TypeScript

Relative to JSON.parse (native C++). Vitest bench, 5 iterations, 5s warmup.

Parser data (35KB) apache (124KB) twitter (617KB) citm (1.7MB) canada (2.1MB)
JSON.parse (native) 1.00x 1.00x 1.00x 1.00x 1.00x
parse-that 5.04x 4.95x 6.64x 6.64x 2.65x
Chevrotain 6.21x 6.98x
Peggy 23.3x 24.6x
Parsimmon 26.5x 29.9x
Nearley + moo 65.2x

Key optimizations: mutable ParserState (zero-alloc), Tarjan's SCC for minimal lazy wrappers, FIRST-set dispatch tables (O(1) alternation), regex test()+substring() (no RegExpMatchArray alloc), inline wrap().

See docs/perf-optimization-ts.md for the full optimization chronicle.

CSS

Rust MB/s on normalize.css (6KB), bootstrap.css (281KB), tailwind-output.css (3.6MB). BBNF uses cold BumpSlab per-parse.

Parser normalize bootstrap tailwind Level
BBNF pretty slab 659 276 284 L1.75 — typed AST for formatting
BBNF parse-only 639 939 469 L0 — parse phase isolated
cssparser 323 437 360 L0 — tokenizer, callbacks only
lightningcss 254 117 90 L2 — semantic

The pretty slab tier produces the full typed AST that gorgeous uses for CSS formatting. A delimiter-driven flat scanner in the monolithic codegen handles Wrap(Repeat(Alt)) patterns with overlapping FIRST sets—uses forward memchr to select the branch, then calls the existing recursive descent for typed output. Grammar-agnostic; all delimiter bytes extracted from the grammar's Literal nodes.

TypeScript (relative to parse-that):

Parser normalize (6KB) bootstrap (274KB)
parse-that (L1.75) 1.00x 1.00x
postcss (L1) 1.65x slower 1.34x slower
css-tree (L1-L2) 2.43x slower 2.13x slower

Debugging

The debug combinator pretty-prints parser state during execution.

image

As output, you'll see:

  • Header: parsing status (Ok/Err/Done), current offset, node name, stringified parser
  • Body: surrounding lines of input with the cursor at the active column, line-numbered

Color-coded: BBNF nonterminals in blue, stringified parsers in yellow.

Diagnostics

Both implementations ship a structured diagnostics system — Rust behind a diagnostics Cargo feature, TypeScript via enableDiagnostics() / disableDiagnostics(). Zero overhead when off.

When enabled, parsers accumulate at the furthest offset:

  • Expected sets — leaf parsers pre-compute labels at construction time ("hello", /[0-9]+/, one of ['a'-'z']). Alternation chains merge them into expected X, Y, or Z (Oxford comma).
  • Suggestionswrap() detects unclosed delimiters and emits help: unclosed '(' — insert matching ')'; EOF checks flag trailing content.
  • Secondary spans — point back to related source locations (unclosed '{' opened here) for multi-site error context.

Rich rendering: ANSI color output with TTY detection and NO_COLOR respect, center-truncation of long lines around the error column, ±4 lines of context with gutter line numbers. Shared test vectors in grammar/tests/debug/ ensure isomorphic output between TypeScript and Rust.

Error Recovery

The recover(sync, sentinel) combinator enables multi-error parsing — the standard technique used by production compilers (rustc, GCC, clang) to parse past failures and keep going.

const decl = declaration.trim().recover(declSync, "RECOVERED");
const block = decl.many(0).wrap("{", "}");

On failure, recover() snapshots the current diagnostic state into a Diagnostic object (expected sets, suggestions, secondary spans, source location), pushes it to a collection, then invokes the sync parser to skip forward to a known recovery point (;, }, etc.) and returns the sentinel. many() loops keep going — each failed element produces a diagnostic but doesn't halt the overall parse.

clearCollectedDiagnostics();
stylesheet.parse(cssWithErrors);
const diagnostics = getCollectedDiagnostics();
console.error(formatAllDiagnostics(diagnostics, css));

Both TypeScript and Rust expose the same API: collectDiagnostic() / push_diagnostic(), getCollectedDiagnostics() / get_collected_diagnostics(), formatDiagnostic() / format_diagnostic(). Rust uses thread-local storage; TypeScript uses module-level globals. See grammar/tests/css/complex-errors.css for a multi-error test vector.

BBNF and the Great Parser Generator

Better Backus-Naur Form: a readable, practical grammar notation. An extension of EBNF with skip/next operators, regex terminals, mapping functions, and an @import system.

The BBNF grammar for BBNF itself lives in the bbnf-lang repo.

With your grammar in hand, call generateParserFromBBNF (TypeScript) or use #[derive(Parser)] (Rust):

const [nonterminals, ast] = generateParserFromBBNF(grammar);
#[derive(Parser)]
#[parser(path = "grammar/json.bbnf")]
pub struct Json;

let result = Json::value().parse(input);

Each nonterminal is a Parser object. The BBNF parser-generator is itself written in BBNF—self-hosting. The BBNF ecosystem (compiler, LSP, VS Code extension) lives in the separate bbnf-lang repo.

Operators

Syntax Meaning
A? / [ A ] Optional A
A* / { A } Repeated A (0 or more)
A+ Repeated A (1 or more)
A | B A or B (higher precedence than ,)
A, B A followed by B
A - B A, but not B
A >> B A then B, return B only
A << B A then B, return A only
( A ) Grouping

Emojis supported. Epsilon has a special value: ε.

Left recursion & more

Direct and indirect left recursion are fully supported, as are highly ambiguous grammars:

expr = expr , "+" , expr
     | integer
     | string ;

Using BBNF

The BBNF compiler optimizes the grammar automatically:

  1. Topological sort via Tarjan's SCC
  2. Remove indirect left recursion
  3. Remove direct left recursion
  4. Left factorize

Combinator support

Left recursion via memoize and mergeMemos:

const expression = Parser.lazy(() =>
    all(expression, operators.then(expression).opt()).mergeMemos().or(number)
).memoize();

See memoize.test.ts for details.

Caveats

Left recursion works but isn't optimal. If it can be factored out via BBNF, performance will be fine; otherwise expect slowdowns, since JavaScript lacks proper tail call optimization.

API & examples

See api.md for API information.

See the TypeScript tests and Rust tests for working examples.

Sources, acknowledgements, & c.

Theory

Notation

Performance

  • Eisel, D. & Lemire, D. (2021). Number parsing at a gigabyte per second. — fast_float2 crate for Rust JSON float parsing.
  • memchr — SIMD-accelerated byte scanning. Used for memchr2-based JSON string scanning on AArch64 (NEON) and x86 (SSE2/AVX2).
  • rustc_ast/format.rsRust compiler source. Reference for u32 keyword matching pattern.

Parser libraries (competitors and influences)

  • Parsimmon — Monadic TS parser combinators. Benchmarked as baseline.
  • Chevrotain — TS parser toolkit with CST. Closest TS competitor.
  • nom — Rust parser combinators. The Rust ecosystem standard.
  • winnow — nom's successor with improved dispatch.
  • pest — PEG parser for Rust. Grammar-driven.
  • sonic-rs — ByteDance's SIMD JSON parser. The ceiling in our benchmarks.
  • simd-json — Rust port of simdjson's 3-phase architecture.
  • jiter — Pydantic's scalar JSON parser. Closest peer to our fast path.

About

Parser combinators for TypeScript - with BBNF (Better Backus–Naur Form).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages