feat: env-var configurable httpx pool + TLS in split_pdf_hook (0.45.0)#344
Merged
Conversation
Make httpx.AsyncClient pool config in split_pdf_hook.run_tasks configurable via env vars: - UNSTRUCTURED_CLIENT_MAX_CONNECTIONS (default 100) - UNSTRUCTURED_CLIENT_MAX_KEEPALIVE_CONNECTIONS (default 20) - UNSTRUCTURED_CLIENT_KEEPALIVE_EXPIRY (default 5.0 seconds) Defaults match httpx's built-in defaults, so this is fully backward compatible. Also extends the existing split_pdf event=plan_created INFO log to include the resolved pool values, making the active config visible in production logs. When the SDK is used in an environment where load balancing happens at TCP-connect time rather than per-request (a common Kubernetes setup with simple Services), httpx's default keepalive pooling can lock onto a subset of backends. Newly added backends never receive traffic because existing connections stay glued to the originally-resolved set. Allowing operators to force shorter keepalive (e.g. MAX_KEEPALIVE_CONNECTIONS=1 + a low KEEPALIVE_EXPIRY) makes the client re-establish connections more frequently, redistributing across the available backends. Defaults are unchanged — this purely adds knobs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7bb29b9 to
60f15e2
Compare
ctrahey
reviewed
Jun 4, 2026
ctrahey
left a comment
There was a problem hiding this comment.
Please update every httpx client to accept configurable TLS certificates as we touch them!
Addresses review feedback on the pool-config PR: extend the same
env-var-driven approach to TLS verification and client certificates
so operators can plug in a custom CA bundle or mTLS cert without
modifying SDK code.
New env vars (all unset by default → httpx defaults):
- UNSTRUCTURED_CLIENT_TLS_CA_BUNDLE: path to a CA bundle file that
overrides the system trust store. Typical use: internal CA for a
corporate proxy or private hosting endpoint.
- UNSTRUCTURED_CLIENT_TLS_VERIFY: set to a falsy value
("false"/"0"/"no"/"off") to disable server cert verification
entirely. Dev-only path.
- UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT: path to a client cert PEM
file for mTLS.
- UNSTRUCTURED_CLIENT_TLS_CLIENT_KEY: optional separate key file
path. If unset, httpx reads the key from the cert PEM.
If CA bundle is set, it wins over the verify flag — explicit trust
store beats "disable verify".
Resolved config flows into both the existing batch_async_start DEBUG
log and the plan_created INFO log as `tls=<verify-mode> <cert-mode>`,
using human-readable descriptors that don't leak filesystem paths.
Tests cover: defaults, CA bundle path, verify-false (parameterized
over case/synonyms), verify-true (parameterized over truthy values),
CA-bundle-wins-over-verify, client cert alone, cert+key split, and a
mocked-AsyncClient end-to-end check that the resolved config reaches
the httpx.AsyncClient kwargs. 108/108 split_pdf_hook tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
awalker4
approved these changes
Jun 5, 2026
Collaborator
awalker4
left a comment
There was a problem hiding this comment.
LGTM! The repo is no longer auto-generated, so we're empowered to own the changelog and bump the version before doing a github release.
Address Trahey's review feedback: 1. Drop the word "Client" from the trust-store env vars. In TLS, "client" specifically means mTLS client authentication, so a name like UNSTRUCTURED_CLIENT_TLS_VERIFY is ambiguous when it's really about server-cert verification. 2. Honor the standard env vars other libraries already respect (SSL_CERT_FILE first, then REQUESTS_CA_BUNDLE). A single env-var setting now applies uniformly across Python tooling. The mTLS client-auth env vars (UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT / _CLIENT_KEY) keep their names — the word "CLIENT" there refers to TLS client authentication, which is its correct usage. The disable-verify knob is removed entirely; the standard env vars have no such escape hatch by design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Austin's review note: the repo no longer auto-generates, so this PR owns its version bump and changelog entry directly. - src/unstructured_client/_version.py: 0.44.1 -> 0.45.0 - CHANGELOG.md: 0.45.0 Features section covering pool limits, trust store (SSL_CERT_FILE / REQUESTS_CA_BUNDLE), mTLS client cert, and the extended plan_created observability log - RELEASES.md: append matching Speakeasy-style v0.45.0 entry Minor bump because all additions are new optional env vars with defaults that match httpx (fully backward compatible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds env-var knobs for the
httpx.AsyncClientused bysplit_pdf_hook.run_tasks, and ships them as 0.45.0. Defaults match httpx — fully backward compatible.Connection-pool limits
UNSTRUCTURED_CLIENT_MAX_CONNECTIONS(default100)UNSTRUCTURED_CLIENT_MAX_KEEPALIVE_CONNECTIONS(default20)UNSTRUCTURED_CLIENT_KEEPALIVE_EXPIRY(default5.0seconds)TLS trust store (server verification)
Honors the standard env vars other Python tooling already respects, so a single setting applies uniformly:
SSL_CERT_FILE(stdlibsslconvention)REQUESTS_CA_BUNDLE(requests / httpx-ecosystem convention; used ifSSL_CERT_FILEis unset)mTLS client certificate
UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT— PEM file (httpx reads key from the same file by default)UNSTRUCTURED_CLIENT_TLS_CLIENT_KEY— optional, when cert and key live in separate filesObservability
split_pdf event=plan_createdINFO log to include the resolved pool values and trust-store / mTLS mode, so the active config is visible in production logs without leaking filesystem paths.Release
_version.pyto0.45.0, adds a0.45.0CHANGELOG.mdsection, and appends a matchingRELEASES.mdentry.Why
When the SDK runs in an environment where load balancing happens at TCP-connect time rather than per-request (a common Kubernetes setup with a plain ClusterIP and no service mesh), httpx's default keepalive pooling can lock onto a subset of backends. Newly added backends never receive traffic because existing connections stay glued to the originally-resolved set.
Letting operators force shorter keepalive (e.g.
MAX_KEEPALIVE_CONNECTIONS=1+ a lowKEEPALIVE_EXPIRY) makes the client re-establish connections more frequently, redistributing across the available backends.The TLS additions are for SDK consumers running behind corporate proxies with custom CAs, or against backends that require mTLS — previously they had to subclass / monkey-patch to get a custom
verifyorcertinto the split-PDF client.How to use