fix: prevent ReDoS in HTML tokenizer via bounded substring matching (issue #707) by vanhci · Pull Request #708 · trentm/python-markdown2

vanhci · 2026-05-24T01:06:39Z

Summary

Fix a ReDoS (Regular Expression Denial of Service) vulnerability in the HTML tokenizer that allows a crafted ~60KB Markdown input to monopolize the Python render process for 2+ seconds.

Problem

_sorta_html_tokenize_re.split(text) applies the full tokenizer regex across the entire input. When the input contains repeated malformed tag fragments like <p m="1"<p m="1"<..., the regex engine attempts exponentially many attribute-split combinations during backtracking, causing O(2^n) CPU time.

Fix

Two changes, both replacing regex-heavy approaches with linear string operations:

_sorta_html_tokenize() (new method): Replaces _sorta_html_tokenize_re.split(text) with a hand-written tokenizer that:
- Locates < characters via str.find() (O(n) linear scan)
- Bounds each candidate token between < and the next >
- Uses the existing regex only in .match() mode on the bounded substring
- This keeps regex work O(1) per token instead of O(n) over the full input
_tag_is_closed(): Replaces re.findall('<%s(?:.*?)>' % tag_name, text) with a str.find() loop to avoid the same class of ReDoS.

Benchmark

60KB trigger payload: 2.2s → 0.05s (40x speedup)

All existing tests pass. Normal Markdown rendering is unaffected.

Replace _sorta_html_tokenize_re.split(text) with a hand-written tokenizer (_sorta_html_tokenize) that first locates <...> boundaries using simple string operations, then validates each bounded substring with the existing regex in .match() mode. This eliminates catastrophic backtracking on malformed HTML fragments like repeated <p m="1"< sequences, where the regex engine would previously attempt exponentially many attribute-split combinations across the full input. Also replaces the re.findall call in _tag_is_closed with a linear str.find loop to avoid the same class of ReDoS. Benchmark: 60KB trigger payload goes from 2.2s to 0.05s (40x speedup). All existing tests pass. Fixes #707

Crozzers · 2026-05-24T09:12:08Z

        # here.
        escaped = []
-        is_html_markup = False
-        for token in self._sorta_html_tokenize_re.split(text):


I think the regex split is also used in _hash_html_spans. Worth replacing that as well if tests pass?

Crozzers reviewed May 24, 2026

View reviewed changes

vanhci closed this by deleting the head repository May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent ReDoS in HTML tokenizer via bounded substring matching (issue #707)#708

fix: prevent ReDoS in HTML tokenizer via bounded substring matching (issue #707)#708
vanhci wants to merge 1 commit into
trentm:masterfrom
vanhci:fix/issue-707-html-tokenizer-redos

vanhci commented May 24, 2026

Uh oh!

Crozzers May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vanhci commented May 24, 2026

Summary

Problem

Fix

Benchmark

Related

Uh oh!

Crozzers May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants