Aiohttp unification by BenjaminSchaaf · Pull Request #397 · packagecontrol/thecrawl

BenjaminSchaaf · 2026-06-11T03:50:37Z

Unify aiohttp client creation

Headers like user-agent and settings like rate limiting and
raise_for_status are currently applied inconsistently across scripts.
This removes that duplication and simplifies http client usage.

kaste · 2026-06-12T08:08:55Z

This one is buggy.

You remove raise_for_status=True in compress_channel.py but never use the new helper.
You also remove resp.raise_for_status() in all concrete provider implementations but have the
main caller crawl.py untouched.

Unsure, if leaving crawl.py as-is is intentional or not.

The runtime change is that you increase potential concurrency in generate_registry.py, basically
in the only file where I restricted it. And if we were to transform crawl.py too, we would
decrease its concurrency, which is the main horse here.

return await resp.json(content_type=None) can return None where json.loads would have thrown. Undecided if this is significant.

MAX_CONCURRENCY is unused after your change.

Headers like user-agent and settings like rate limiting and raise_for_status are currently applied inconsistently across scripts. This removes that duplication and simplifies http client usage.

aiohttp by default keep track of cookies for a session. There is no need for this as we're only making API calls.

BenjaminSchaaf · 2026-06-12T09:52:20Z

Thanks for the review, I've fixed the missing calls in compress_channel.py and crawl.py and removed the unused MAX_CONCURRENCY.

The runtime change is that you increase potential concurrency in generate_registry.py, basically
in the only file where I restricted it.

Only across different domains though, otherwise the same limit is applied in create_aiohttp_session. If the total connections are a problem I could make this configurable? What was the reason this limit was added, I couldn't find any explanation in the git history?

return await resp.json(content_type=None) can return None where json.loads would have thrown. Undecided if this is significant.

content_type=None parameter doesn't change the behavior beyond removing the content-type check, see https://github.com/aio-libs/aiohttp/blob/12ea5a580ac02f5062b927ac3fa278000d7abb04/aiohttp/client_reqrep.py#L672. json.loads can always return None if the json is null, and ClientResponse.json calls json.loads here anyway, so I don't see any case where json.loads would have thrown where this code doesn't.

kaste · 2026-06-12T11:38:14Z

ClientResponse.json (just) calls json.loads

thanks for looking that up.

What was the reason this limit was added

Generally, the sketch of this file is copy/paste from some other project. The limit is here because of GIL churn. At least very, very likely. It is/was faster with it. But because how the data is spread in this case, basically everything comes from "raw.githubusercontent.com", there is no difference. In the original source I obviously had different data, and likely more heavy computational cost. This is all pretty light here.

You do apply the limit to crawl.py with this change. This makes it slower and I or GitHub's CI never needed it. Did you ran into this to be necessary? (On CI we run with -n 1500, and 32 vs 100 is significant.)

Otherwise, looks good.

kaste · 2026-06-12T11:51:59Z

A follow up can be a simple env variable e.g. MAX_CONNECTIONS_PER_HOST which we use if set. This way we wouldn't need to add this as an arg to every script/command.

MAX_CONNECTIONS_PER_HOST environment variable can now be set to override the per-host connection limit, and scripts can set their own optimized default settings.

BenjaminSchaaf · 2026-06-15T04:16:16Z

I've added a MAX_CONNECTIONS_PER_HOST environment variable and set the default for only generate_registry.py to 32 connections per host.

BenjaminSchaaf added 2 commits June 12, 2026 19:44

Unify aiohttp client creation

9afee74

Headers like user-agent and settings like rate limiting and raise_for_status are currently applied inconsistently across scripts. This removes that duplication and simplifies http client usage.

Use dummy cookie jar for aiohttp

b362c23

aiohttp by default keep track of cookies for a session. There is no need for this as we're only making API calls.

BenjaminSchaaf force-pushed the aiohttp-unification branch from c972458 to b362c23 Compare June 12, 2026 09:52

Make connection limits configurable

5bc0b09

MAX_CONNECTIONS_PER_HOST environment variable can now be set to override the per-host connection limit, and scripts can set their own optimized default settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aiohttp unification#397

Aiohttp unification#397
BenjaminSchaaf wants to merge 3 commits into
packagecontrol:mainfrom
sublimehq:aiohttp-unification

BenjaminSchaaf commented Jun 11, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

BenjaminSchaaf commented Jun 12, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

BenjaminSchaaf commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BenjaminSchaaf commented Jun 11, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

BenjaminSchaaf commented Jun 12, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

kaste commented Jun 12, 2026

Uh oh!

BenjaminSchaaf commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants