Skip to content

feat: env-var configurable httpx pool + TLS in split_pdf_hook (0.45.0)#344

Merged
fxdgear merged 4 commits into
mainfrom
feat/httpx-pool-env-config
Jun 5, 2026
Merged

feat: env-var configurable httpx pool + TLS in split_pdf_hook (0.45.0)#344
fxdgear merged 4 commits into
mainfrom
feat/httpx-pool-env-config

Conversation

@fxdgear
Copy link
Copy Markdown
Contributor

@fxdgear fxdgear commented Jun 4, 2026

What

Adds env-var knobs for the httpx.AsyncClient used by split_pdf_hook.run_tasks, and ships them as 0.45.0. Defaults match httpx — fully backward compatible.

Connection-pool limits

  • UNSTRUCTURED_CLIENT_MAX_CONNECTIONS (default 100)
  • UNSTRUCTURED_CLIENT_MAX_KEEPALIVE_CONNECTIONS (default 20)
  • UNSTRUCTURED_CLIENT_KEEPALIVE_EXPIRY (default 5.0 seconds)

TLS trust store (server verification)

Honors the standard env vars other Python tooling already respects, so a single setting applies uniformly:

  • SSL_CERT_FILE (stdlib ssl convention)
  • REQUESTS_CA_BUNDLE (requests / httpx-ecosystem convention; used if SSL_CERT_FILE is unset)

mTLS client certificate

  • UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT — PEM file (httpx reads key from the same file by default)
  • UNSTRUCTURED_CLIENT_TLS_CLIENT_KEY — optional, when cert and key live in separate files

Observability

  • Extends the existing split_pdf event=plan_created INFO log to include the resolved pool values and trust-store / mTLS mode, so the active config is visible in production logs without leaking filesystem paths.

Release

  • Bumps _version.py to 0.45.0, adds a 0.45.0 CHANGELOG.md section, and appends a matching RELEASES.md entry.

Why

When the SDK runs in an environment where load balancing happens at TCP-connect time rather than per-request (a common Kubernetes setup with a plain ClusterIP and no service mesh), httpx's default keepalive pooling can lock onto a subset of backends. Newly added backends never receive traffic because existing connections stay glued to the originally-resolved set.

Letting operators force shorter keepalive (e.g. MAX_KEEPALIVE_CONNECTIONS=1 + a low KEEPALIVE_EXPIRY) makes the client re-establish connections more frequently, redistributing across the available backends.

The TLS additions are for SDK consumers running behind corporate proxies with custom CAs, or against backends that require mTLS — previously they had to subclass / monkey-patch to get a custom verify or cert into the split-PDF client.

How to use

env:
  # Pool reshuffling for connect-time-only LBs
  - name: UNSTRUCTURED_CLIENT_MAX_KEEPALIVE_CONNECTIONS
    value: "1"
  - name: UNSTRUCTURED_CLIENT_KEEPALIVE_EXPIRY
    value: "30.0"

  # Custom trust store (standard env var, picked up by httpx, requests, ssl)
  - name: SSL_CERT_FILE
    value: /etc/ssl/internal-ca-bundle.pem

  # mTLS
  - name: UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT
    value: /etc/ssl/client.crt
  - name: UNSTRUCTURED_CLIENT_TLS_CLIENT_KEY
    value: /etc/ssl/client.key

Make httpx.AsyncClient pool config in split_pdf_hook.run_tasks configurable via env vars:

- UNSTRUCTURED_CLIENT_MAX_CONNECTIONS (default 100)
- UNSTRUCTURED_CLIENT_MAX_KEEPALIVE_CONNECTIONS (default 20)
- UNSTRUCTURED_CLIENT_KEEPALIVE_EXPIRY (default 5.0 seconds)

Defaults match httpx's built-in defaults, so this is fully backward
compatible.

Also extends the existing split_pdf event=plan_created INFO log to
include the resolved pool values, making the active config visible in
production logs.

When the SDK is used in an environment where load balancing happens at
TCP-connect time rather than per-request (a common Kubernetes setup
with simple Services), httpx's default keepalive pooling can lock onto
a subset of backends. Newly added backends never receive traffic
because existing connections stay glued to the originally-resolved
set. Allowing operators to force shorter keepalive (e.g.
MAX_KEEPALIVE_CONNECTIONS=1 + a low KEEPALIVE_EXPIRY) makes the client
re-establish connections more frequently, redistributing across the
available backends.

Defaults are unchanged — this purely adds knobs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@fxdgear fxdgear force-pushed the feat/httpx-pool-env-config branch from 7bb29b9 to 60f15e2 Compare June 4, 2026 17:40
@fxdgear fxdgear marked this pull request as ready for review June 4, 2026 17:58
Copy link
Copy Markdown

@ctrahey ctrahey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update every httpx client to accept configurable TLS certificates as we touch them!

Addresses review feedback on the pool-config PR: extend the same
env-var-driven approach to TLS verification and client certificates
so operators can plug in a custom CA bundle or mTLS cert without
modifying SDK code.

New env vars (all unset by default → httpx defaults):

- UNSTRUCTURED_CLIENT_TLS_CA_BUNDLE: path to a CA bundle file that
  overrides the system trust store. Typical use: internal CA for a
  corporate proxy or private hosting endpoint.
- UNSTRUCTURED_CLIENT_TLS_VERIFY: set to a falsy value
  ("false"/"0"/"no"/"off") to disable server cert verification
  entirely. Dev-only path.
- UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT: path to a client cert PEM
  file for mTLS.
- UNSTRUCTURED_CLIENT_TLS_CLIENT_KEY: optional separate key file
  path. If unset, httpx reads the key from the cert PEM.

If CA bundle is set, it wins over the verify flag — explicit trust
store beats "disable verify".

Resolved config flows into both the existing batch_async_start DEBUG
log and the plan_created INFO log as `tls=<verify-mode> <cert-mode>`,
using human-readable descriptors that don't leak filesystem paths.

Tests cover: defaults, CA bundle path, verify-false (parameterized
over case/synonyms), verify-true (parameterized over truthy values),
CA-bundle-wins-over-verify, client cert alone, cert+key split, and a
mocked-AsyncClient end-to-end check that the resolved config reaches
the httpx.AsyncClient kwargs. 108/108 split_pdf_hook tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@awalker4 awalker4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The repo is no longer auto-generated, so we're empowered to own the changelog and bump the version before doing a github release.

fxdgear and others added 2 commits June 5, 2026 14:28
Address Trahey's review feedback:

1. Drop the word "Client" from the trust-store env vars. In TLS,
   "client" specifically means mTLS client authentication, so a name
   like UNSTRUCTURED_CLIENT_TLS_VERIFY is ambiguous when it's really
   about server-cert verification.

2. Honor the standard env vars other libraries already respect
   (SSL_CERT_FILE first, then REQUESTS_CA_BUNDLE). A single env-var
   setting now applies uniformly across Python tooling.

The mTLS client-auth env vars (UNSTRUCTURED_CLIENT_TLS_CLIENT_CERT /
_CLIENT_KEY) keep their names — the word "CLIENT" there refers to TLS
client authentication, which is its correct usage.

The disable-verify knob is removed entirely; the standard env vars
have no such escape hatch by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Austin's review note: the repo no longer auto-generates, so this PR
owns its version bump and changelog entry directly.

- src/unstructured_client/_version.py: 0.44.1 -> 0.45.0
- CHANGELOG.md: 0.45.0 Features section covering pool limits, trust
  store (SSL_CERT_FILE / REQUESTS_CA_BUNDLE), mTLS client cert, and
  the extended plan_created observability log
- RELEASES.md: append matching Speakeasy-style v0.45.0 entry

Minor bump because all additions are new optional env vars with
defaults that match httpx (fully backward compatible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fxdgear fxdgear changed the title feat: env-var configurable httpx pool config in split_pdf_hook feat: env-var configurable httpx pool + TLS in split_pdf_hook (0.45.0) Jun 5, 2026
@fxdgear fxdgear merged commit 7ab4de9 into main Jun 5, 2026
16 checks passed
@fxdgear fxdgear deleted the feat/httpx-pool-env-config branch June 5, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants