Prevent deadlock with urllib3 logging during shutdown + add bounded f…#37
Open
oesi333 wants to merge 1 commit intologtail:masterfrom
Open
Prevent deadlock with urllib3 logging during shutdown + add bounded f…#37oesi333 wants to merge 1 commit intologtail:masterfrom
oesi333 wants to merge 1 commit intologtail:masterfrom
Conversation
…lush timeout When LogtailHandler.flush() is called while another thread holds Python's global logging lock — most reproducibly during logging.shutdown() or logging.config.dictConfig() (e.g. triggered at import time by libraries like pymilvus) — the FlushWorker can deadlock against the calling thread. Cause: the FlushWorker uses requests/urllib3 to POST log frames; urllib3's connection-pool code calls logger.debug() from inside _make_request(), which requires logging._lock. The thread invoking shutdown holds that same lock while iterating handlers and calling flush(). If our flush() blocks waiting for the FlushWorker, both threads are stuck. Two complementary fixes: 1. Quiet urllib3 to WARNING when handler.py is imported, with an opt-out via LOGTAIL_KEEP_URLLIB3_LOGS=1. This makes urllib3's per-request debug() calls cheap no-ops, so the FlushWorker no longer needs the global logging lock during uploads. Existing urllib3 levels above WARNING are left untouched. 2. Bound LogtailHandler.flush() with a timeout (default 30s, configurable via the new flush_timeout constructor argument; pass None to keep the legacy unbounded wait). FlushWorker.flush() now accepts an optional timeout and returns True/False indicating whether the queue was fully drained. Shutdown will no longer hang the host process indefinitely when BetterStack is slow or unreachable, and the deadlock window from any future similar issue is bounded. Tests: five new unit tests covering both code paths; existing tests pass unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
LogtailHandler.flush()can deadlock the host process when called duringlogging.shutdown()/logging.config.dictConfig(). This is reachable in real-world setups - e.g. anyone usingpymilvus(which callsdictConfig()at module import time) alongsidelogtail-python. We hit it deterministically when BetterStack was slow.Stack trace from
py-spy dumpon the deadlocked processMainThread:
flush (logtail/flusher.py:83)
flush (logtail/handler.py:81)
shutdown (logging/init.py:2182)
_clearExistingHandlers (logging/config.py:275)
configure (logging/config.py:544)
dictConfig (logging/config.py:811)
init_log (pymilvus/settings.py:99)
(pymilvus/settings.py:102)
...
Thread-1 (FlushWorker):
_acquireLock (logging/init.py:226)
isEnabledFor (logging/init.py:1734)
debug (logging/init.py:1464)
_make_request (urllib3/connectionpool.py:556)
urlopen (urllib3/connectionpool.py:803)
send (requests/adapters.py:658)
...
call (logtail/uploader.py:24)
step (logtail/flusher.py:67)
run (logtail/flusher.py:27)
MainThreadholdslogging._lock(acquired inside_clearExistingHandlers) while waiting for the FlushWorker to drain. The FlushWorker is inurllib3._make_request, which callslogger.debug(...)- also needinglogging._lock. Classic A-waits-B, B-waits-A.Fix
Two complementary changes:
Quiet
urllib3to WARNING whenlogtail/handler.pyis imported (opt-out viaLOGTAIL_KEEP_URLLIB3_LOGS=1). This makes urllib3's per-requestdebug()calls cheap no-ops, so the FlushWorker no longer needslogging._lockduring uploads. Existing higher-than-WARNING urllib3 levels are left untouched.Bound
LogtailHandler.flush()with aflush_timeout(default 30s; passNonefor legacy unbounded).FlushWorker.flush()now acceptstimeoutand returnsTrue/Falseon drained/timed-out. This prevents shutdown from hanging the host process if BetterStack is slow/unreachable, and bounds any future similar deadlock window.Tests
Five new unit tests, all 31 non-hanging tests pass. (The pre-existing
test_can_send_*tests intest_handler.pyhang on master too — separate issue.)Backwards compatibility
LogtailHandler(...)— new optional kwargflush_timeout=30. Existing callers unaffected.FlushWorker.flush()— new optionaltimeoutkwarg, defaults toNone(legacy unbounded). Callers passing no args get the same behavior as before.