Skip to content

Aiohttp unification#397

Open
BenjaminSchaaf wants to merge 3 commits into
packagecontrol:mainfrom
sublimehq:aiohttp-unification
Open

Aiohttp unification#397
BenjaminSchaaf wants to merge 3 commits into
packagecontrol:mainfrom
sublimehq:aiohttp-unification

Conversation

@BenjaminSchaaf

Copy link
Copy Markdown

Unify aiohttp client creation

Headers like user-agent and settings like rate limiting and
raise_for_status are currently applied inconsistently across scripts.
This removes that duplication and simplifies http client usage.

@kaste

kaste commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

This one is buggy.

  • You remove raise_for_status=True in compress_channel.py but never use the new helper.

  • You also remove resp.raise_for_status() in all concrete provider implementations but have the
    main caller crawl.py untouched.

    Unsure, if leaving crawl.py as-is is intentional or not.

The runtime change is that you increase potential concurrency in generate_registry.py, basically
in the only file where I restricted it. And if we were to transform crawl.py too, we would
decrease its concurrency, which is the main horse here.

return await resp.json(content_type=None) can return None where json.loads would have thrown. Undecided if this is significant.

MAX_CONCURRENCY is unused after your change.

Headers like user-agent and settings like rate limiting and
raise_for_status are currently applied inconsistently across scripts.
This removes that duplication and simplifies http client usage.
aiohttp by default keep track of cookies for a session. There is no need
for this as we're only making API calls.
@BenjaminSchaaf

Copy link
Copy Markdown
Author

Thanks for the review, I've fixed the missing calls in compress_channel.py and crawl.py and removed the unused MAX_CONCURRENCY.

The runtime change is that you increase potential concurrency in generate_registry.py, basically
in the only file where I restricted it.

Only across different domains though, otherwise the same limit is applied in create_aiohttp_session. If the total connections are a problem I could make this configurable? What was the reason this limit was added, I couldn't find any explanation in the git history?

return await resp.json(content_type=None) can return None where json.loads would have thrown. Undecided if this is significant.

content_type=None parameter doesn't change the behavior beyond removing the content-type check, see https://github.com/aio-libs/aiohttp/blob/12ea5a580ac02f5062b927ac3fa278000d7abb04/aiohttp/client_reqrep.py#L672. json.loads can always return None if the json is null, and ClientResponse.json calls json.loads here anyway, so I don't see any case where json.loads would have thrown where this code doesn't.

@kaste

kaste commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

ClientResponse.json (just) calls json.loads

thanks for looking that up.

What was the reason this limit was added

Generally, the sketch of this file is copy/paste from some other project. The limit is here because of GIL churn. At least very, very likely. It is/was faster with it. But because how the data is spread in this case, basically everything comes from "raw.githubusercontent.com", there is no difference. In the original source I obviously had different data, and likely more heavy computational cost. This is all pretty light here.

You do apply the limit to crawl.py with this change. This makes it slower and I or GitHub's CI never needed it. Did you ran into this to be necessary? (On CI we run with -n 1500, and 32 vs 100 is significant.)

Otherwise, looks good.

@kaste

kaste commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

A follow up can be a simple env variable e.g. MAX_CONNECTIONS_PER_HOST which we use if set. This way we wouldn't need to add this as an arg to every script/command.

MAX_CONNECTIONS_PER_HOST environment variable can now be set to override
the per-host connection limit, and scripts can set their own optimized
default settings.
@BenjaminSchaaf

Copy link
Copy Markdown
Author

I've added a MAX_CONNECTIONS_PER_HOST environment variable and set the default for only generate_registry.py to 32 connections per host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants