Skip to content

fix: prevent ReDoS in HTML tokenizer via bounded substring matching (issue #707)#708

Closed
vanhci wants to merge 1 commit into
trentm:masterfrom
vanhci:fix/issue-707-html-tokenizer-redos
Closed

fix: prevent ReDoS in HTML tokenizer via bounded substring matching (issue #707)#708
vanhci wants to merge 1 commit into
trentm:masterfrom
vanhci:fix/issue-707-html-tokenizer-redos

Conversation

@vanhci
Copy link
Copy Markdown

@vanhci vanhci commented May 24, 2026

Summary

Fix a ReDoS (Regular Expression Denial of Service) vulnerability in the HTML tokenizer that allows a crafted ~60KB Markdown input to monopolize the Python render process for 2+ seconds.

Problem

_sorta_html_tokenize_re.split(text) applies the full tokenizer regex across the entire input. When the input contains repeated malformed tag fragments like <p m="1"<p m="1"<..., the regex engine attempts exponentially many attribute-split combinations during backtracking, causing O(2^n) CPU time.

Fix

Two changes, both replacing regex-heavy approaches with linear string operations:

  1. _sorta_html_tokenize() (new method): Replaces _sorta_html_tokenize_re.split(text) with a hand-written tokenizer that:

    • Locates < characters via str.find() (O(n) linear scan)
    • Bounds each candidate token between < and the next >
    • Uses the existing regex only in .match() mode on the bounded substring
    • This keeps regex work O(1) per token instead of O(n) over the full input
  2. _tag_is_closed(): Replaces re.findall('<%s(?:.*?)>' % tag_name, text) with a str.find() loop to avoid the same class of ReDoS.

Benchmark

60KB trigger payload: 2.2s → 0.05s (40x speedup)

All existing tests pass. Normal Markdown rendering is unaffected.

Related

Fixes #707

Replace _sorta_html_tokenize_re.split(text) with a hand-written
tokenizer (_sorta_html_tokenize) that first locates <...> boundaries
using simple string operations, then validates each bounded substring
with the existing regex in .match() mode.

This eliminates catastrophic backtracking on malformed HTML fragments
like repeated <p m="1"< sequences, where the regex engine would
previously attempt exponentially many attribute-split combinations
across the full input.

Also replaces the re.findall call in _tag_is_closed with a linear
str.find loop to avoid the same class of ReDoS.

Benchmark: 60KB trigger payload goes from 2.2s to 0.05s (40x speedup).
All existing tests pass.

Fixes #707
Comment thread lib/markdown2.py
# here.
escaped = []
is_html_markup = False
for token in self._sorta_html_tokenize_re.split(text):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regex split is also used in _hash_html_spans. Worth replacing that as well if tests pass?

@vanhci vanhci closed this by deleting the head repository May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markdown2 malformed HTML tokenizer CPU denial of service

2 participants