Skip to content

hasStructuralKeyword: STRUCTURAL_STEMS_RE has no right-side boundary, false-fires on ordinary words (callus, Connecticut, affectionate, Tracey) #1138

Description

@inth3shadows

Summary

Four of the English stems in the multilingual structural-question gate (#1134) — call, trace,
affect, connect — match as unbounded prefixes, so ordinary non-technical English words that merely
start with one of these stems false-fire the gate's HIGH-tier (full-explore) branch.

Root cause

// src/directory.ts:405-406
const STRUCTURAL_STEMS_RE = new RegExp(`${NOT_WORD_BEFORE}(?:${STRUCTURAL_STEMS.join('|')})`, 'iu');

STRUCTURAL_STEMS_RE enforces a LEFT boundary only (by design — see the docstring at
src/directory.ts:312-320 — so "architect" matches "architecture"). Most stems in the list only ever
complete into structural words, so the open right side is safe for them. But call, trace,
affect, connect have common English completions that aren't structural at all: callus,
calligraphy, callous, Connecticut, connective (tissue), affectionate, Tracey. The file's own review
rule for this list ("Add a stem only when every plausible completion is still a structural word",
src/directory.ts:318) doesn't hold for these four.

Repro (verified by direct execution against the built regex, node 22, node:sqlite backend)

import { hasStructuralKeyword } from './src/directory';

hasStructuralKeyword('he has a callus');            // true — should be false
hasStructuralKeyword('Connecticut is a state');      // true — should be false
hasStructuralKeyword('she is very affectionate');    // true — should be false
hasStructuralKeyword('Tracey went home early');       // true — should be false

I have NOT run the full codegraph prompt-hook CLI end-to-end on these strings — the above is a
direct unit-level call. keyworded = hasStructuralKeyword(prompt) at src/bin/codegraph.ts:1094 does
feed straight into the HIGH-tier branch (confirmed by reading that call site), which runs
codegraph_explore and writes a <codegraph_context> block — PR #1134's own reported payload sizes
put HIGH-tier injections around 16KB on this repo for other prompts, so I'd expect a comparable cost
here, though I haven't measured it for these exact strings.

No existing test catches this — __tests__/frontload-hook.test.ts's mid-word guards
("restructure this paragraph", "an independent module", lines 230-231) only exercise the LEFT
boundary.

Impact

Any structural-question prompt containing one of these four stems as a false-positive substring
gets the same unnecessary explore-and-inject cost the STRUCTURAL_WORDS exact-match class exists to
avoid (per its own docstring: "short or ambiguous tokens where prefix matching would
false-positive"). "call" in particular is a common enough substring that this seems likely to fire
in normal use, though I don't have production telemetry to say how often.

Suggested fix

Bound the four risky stems to their known derivational suffixes, matching the suffix-enumeration
pattern STRUCTURAL_WORDS already uses elsewhere in this same file (e.g. reach(?:es|ed)?):

`call(?:s|ing|ed|ers?)?${NOT_WORD_AFTER}`
`trace(?:s|d|rs?)?${NOT_WORD_AFTER}`
`affect(?:s|ed|ing)?${NOT_WORD_AFTER}`
`connect(?:s|ed|ing|ions?|ors?)?${NOT_WORD_AFTER}`

Verified by direct execution: calls/calling/called/caller(s), traces/traced/tracer(s),
affects/affected/affecting, connects/connected/connecting/connection(s)/connector(s) all still
match; callus/calligraphy/callous/Connecticut/connective/affectionate/Tracey no longer do. All 31
existing tests in frontload-hook.test.ts still pass with this change. The other 6 English stems
(architect, structur, depend, implement, impact, explain) didn't turn up demonstrable non-technical
collisions in my check, so I'd leave those as open prefixes rather than touch what isn't
demonstrated broken.

Verification / scope

Environment

Found on main (tip e699ee9, v1.2.0). Not yet in a tagged release (still origin/main as of
2026-07-02).


Happy to send a PR for the fix (I already have it written and tested locally), or just leave this
filed if you'd rather pick it up yourself. Let me know which you'd prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions