Speed up parser by caching normalized identifiers?

# Feature or enhancement

### Proposal:

```python
import ast
body = '섀' + 'S' * 8000000  # non ASCII!
ast.literal_eval('Y(' + body + ')' )  # pretty fast
ast.literal_eval('Y(' + body + '=!')  # noticably slower due to re-normalization
```

After some profiling, it is pretty clear that we are normalising the identifier *many* times:

<img width="2552" height="510" alt="Image" src="https://github.com/user-attachments/assets/88ec6fc1-8220-460b-9783-655856cbfc95" />

Instead of normalising every time in `_PyPegen_new_identifier`:

https://github.com/python/cpython/blob/448d7b96c181d13ca7f8977780e85b53b2716294/Parser/pegen.c#L502-L533

Maybe we could cache it in the `Token` (I could write a patch with something like that :-)?

CC @pablogsal @lysnikolaou 

<sub>This was found by OSS-Fuzz</sub>

### Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

### Links to previous discussion of this feature:

_No response_

	if (!PyUnicode_IS_ASCII(id))
	{
	if (!init_normalization(p))
	{
	Py_DECREF(id);
	goto error;
	}
	PyObject *form = PyUnicode_InternFromString("NFKC");
	if (form == NULL)
	{
	Py_DECREF(id);
	goto error;
	}
	PyObject *args[2] = {form, id};
	PyObject *id2 = PyObject_Vectorcall(p->normalize, args, 2, NULL);
	Py_DECREF(id);
	Py_DECREF(form);
	if (!id2) {
	goto error;
	}

	if (!PyUnicode_Check(id2))
	{
	PyErr_Format(PyExc_TypeError,
	"unicodedata.normalize() must return a string, not "
	"%.200s",
	_PyType_Name(Py_TYPE(id2)));
	Py_DECREF(id2);
	goto error;
	}
	id = id2;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up parser by caching normalized identifiers? #148931

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Speed up parser by caching normalized identifiers? #148931

Description

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions