Skip to content

Prevent deadlock with urllib3 logging during shutdown + add bounded f…#37

Open
oesi333 wants to merge 1 commit intologtail:masterfrom
oesi333:fix/flush-deadlock-with-urllib3-debug-logging
Open

Prevent deadlock with urllib3 logging during shutdown + add bounded f…#37
oesi333 wants to merge 1 commit intologtail:masterfrom
oesi333:fix/flush-deadlock-with-urllib3-debug-logging

Conversation

@oesi333
Copy link
Copy Markdown

@oesi333 oesi333 commented Apr 29, 2026

Problem

LogtailHandler.flush() can deadlock the host process when called during logging.shutdown() / logging.config.dictConfig(). This is reachable in real-world setups - e.g. anyone using pymilvus (which calls dictConfig() at module import time) alongside logtail-python. We hit it deterministically when BetterStack was slow.

Stack trace from py-spy dump on the deadlocked process

MainThread:
flush (logtail/flusher.py:83)
flush (logtail/handler.py:81)
shutdown (logging/init.py:2182)
_clearExistingHandlers (logging/config.py:275)
configure (logging/config.py:544)
dictConfig (logging/config.py:811)
init_log (pymilvus/settings.py:99)
(pymilvus/settings.py:102)
...

Thread-1 (FlushWorker):
_acquireLock (logging/init.py:226)
isEnabledFor (logging/init.py:1734)
debug (logging/init.py:1464)
_make_request (urllib3/connectionpool.py:556)
urlopen (urllib3/connectionpool.py:803)
send (requests/adapters.py:658)
...
call (logtail/uploader.py:24)
step (logtail/flusher.py:67)
run (logtail/flusher.py:27)

MainThread holds logging._lock (acquired inside _clearExistingHandlers) while waiting for the FlushWorker to drain. The FlushWorker is in urllib3._make_request, which calls logger.debug(...) - also needing logging._lock. Classic A-waits-B, B-waits-A.

Fix

Two complementary changes:

  1. Quiet urllib3 to WARNING when logtail/handler.py is imported (opt-out via LOGTAIL_KEEP_URLLIB3_LOGS=1). This makes urllib3's per-request debug() calls cheap no-ops, so the FlushWorker no longer needs logging._lock during uploads. Existing higher-than-WARNING urllib3 levels are left untouched.

  2. Bound LogtailHandler.flush() with a flush_timeout (default 30s; pass None for legacy unbounded). FlushWorker.flush() now accepts timeout and returns True/False on drained/timed-out. This prevents shutdown from hanging the host process if BetterStack is slow/unreachable, and bounds any future similar deadlock window.

Tests

Five new unit tests, all 31 non-hanging tests pass. (The pre-existing test_can_send_* tests in test_handler.py hang on master too — separate issue.)

Backwards compatibility

  • LogtailHandler(...) — new optional kwarg flush_timeout=30. Existing callers unaffected.
  • FlushWorker.flush() — new optional timeout kwarg, defaults to None (legacy unbounded). Callers passing no args get the same behavior as before.
  • urllib3 quieting — opt-out env var preserves original behavior for users who depend on urllib3 DEBUG logs.

…lush timeout

When LogtailHandler.flush() is called while another thread holds Python's
global logging lock — most reproducibly during logging.shutdown() or
logging.config.dictConfig() (e.g. triggered at import time by libraries
like pymilvus) — the FlushWorker can deadlock against the calling thread.

Cause: the FlushWorker uses requests/urllib3 to POST log frames; urllib3's
connection-pool code calls logger.debug() from inside _make_request(),
which requires logging._lock. The thread invoking shutdown holds that
same lock while iterating handlers and calling flush(). If our flush()
blocks waiting for the FlushWorker, both threads are stuck.

Two complementary fixes:

1. Quiet urllib3 to WARNING when handler.py is imported, with an
   opt-out via LOGTAIL_KEEP_URLLIB3_LOGS=1. This makes urllib3's
   per-request debug() calls cheap no-ops, so the FlushWorker no
   longer needs the global logging lock during uploads. Existing
   urllib3 levels above WARNING are left untouched.

2. Bound LogtailHandler.flush() with a timeout (default 30s, configurable
   via the new flush_timeout constructor argument; pass None to keep the
   legacy unbounded wait). FlushWorker.flush() now accepts an optional
   timeout and returns True/False indicating whether the queue was fully
   drained. Shutdown will no longer hang the host process indefinitely
   when BetterStack is slow or unreachable, and the deadlock window from
   any future similar issue is bounded.

Tests: five new unit tests covering both code paths; existing tests pass
unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant