Defer log_lines search_vector indexing off the insert path#4818
Conversation
f3f45af to
cae40bc
Compare
Security Review ✅
|
|
@stuartc , fantastic find here. I noticed that the description seemed to reference inserting multiple log lines at once. Wanted to double check: batch log lines insert is turned off on app.openfn.org, right? I think we tried it a few months back and when the crashes started occurring we switched it off. I believe Joe has been tracking a bug related to log line ordering on the bulk log line feature too. (Just double checking that we're all on the same page here.) |
|
@taylordowns2000 similar, but it's not related to batch log lines, with this we update the tsvector in batches (and asynchronously). should be a bit faster and way less stress on the index. And no, you're correct; we don't use batch logging on production right now. I can't remember what the issue was, but worth following up on this - but yeah it's independent of this. There's a complimentary PR #4821 over here for dataclips. Holding both of them as draft just for a sec, the current approach on this relies on a 'snowballing' oban job, which is nice because it disconnects the insertion of loglines and dataclips from indexing (nice). But there is a small downside in that there is a chance that in a quiet system that has something inserted at like 2s after the minute would have to wait for the next minute to roll over to be indexed/searchable. So it's really designed right now for busy systems. I want to evaluate what something a bit more eager would look like before calling this done done. |
The AFTER INSERT trigger on log_lines computed to_tsvector synchronously inside the insert transaction (it never actually deferred anything) and double-wrote every row via a self-UPDATE. It also lacked the program_limit_exceeded guard dataclips got, so an oversized message aborted the insert (rolling back the whole batch for insert_all). Remove the trigger so inserts leave search_vector NULL, and backfill it out-of-band via a new Oban worker (Lightning.LogLines.SearchVectorWorker) on a dedicated search_indexing queue. Adds a guarded safe_to_tsvector SQL function and a partial index (WHERE search_vector IS NULL) so draining pending rows stays cheap. Covers both the single (run:log) and batch (run:batch_logs) insert paths. Live log streaming is unaffected (push-based via PubSub); only full-text log search lags slightly behind ingestion.
Follow-up hardening on the search_vector deferral, from a review pass: - Fix the worker's snowball chain. Oban's default unique states include :executing and :completed, so a running snowball job matched itself when enqueueing its successor and the insert was silently deduped, breaking the chain after one hop. Restrict uniqueness to [:available, :scheduled]. - Make the pending-search index migration re-runnable: a failed CREATE INDEX CONCURRENTLY leaves an INVALID index that IF NOT EXISTS would skip and ATTACH would leave the parent invalid forever. Drop any invalid leftover first. - Make safe_to_tsvector NULL-defensive and re-runnable (CREATE OR REPLACE, drop STRICT, COALESCE the doc) so it never returns NULL and leaves a row stuck in the pending set. Also the template for the dataclip fix. - Add snowball uniqueness regression tests.
Trim hindsight/diff-narrating comments down to what's non-obvious, and rewrite the SearchVectorWorker moduledoc to read as documentation of the mechanism rather than a justification of the change.
Deferring log_lines.search_vector indexing off the insert path means inserted log lines have a NULL search_vector until SearchVectorWorker drains them. Tests that insert log lines and then search them on the log field matched nothing, since the worker never runs on its own in the test environment. Add Lightning.TestUtils.flush_log_search_index/0, which runs the worker synchronously in-process via Oban.Testing.perform_job/3 so it indexes the uncommitted sandbox rows, and call it in the two invocation_test setups that insert log lines before searching. Also add a positive log-match assertion to the stemming test so a regression that re-NULLs the vector fails loudly rather than passing on an empty result.
cae40bc to
7b7f49a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4818 +/- ##
=======================================
- Coverage 90.3% 90.3% -0.0%
=======================================
Files 442 443 +1
Lines 22547 22562 +15
=======================================
+ Hits 20366 20375 +9
- Misses 2181 2187 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Make the SearchVectorWorker batch_size/max_batches configurable through the Lightning.Config seam (defaults 2500/10 in config.exs, 2/2 in test.exs) so a test can drive the budget-exhaustion/snowball path with a handful of rows. Restructure drain/2 into drain/4 since the max_batches guard can no longer be a compile-time literal, and add a test that exercises the recursive drain, budget guard, and snowball enqueue.
Description
This changes how
log_linesfull-text search gets indexed — taking it off thelog-insert hot path to cut down the
run:logchannel timeouts we've been seeingunder heavy log volume.
The old
AFTER INSERTtrigger computedto_tsvectorsynchronously inside theinsert transaction (so it never actually deferred anything) and double-wrote
every row via a self-
UPDATE. It also had noprogram_limit_exceededguard, soa single oversized message would abort the insert and roll back the whole batch
on the
insert_allpath.Now inserts leave
search_vectorNULL and a background Oban worker(
Lightning.LogLines.SearchVectorWorker) backfills it out-of-band on a dedicatedsearch_indexingqueue. There's a guardedsafe_to_tsvectorSQL function and apartial index (
WHERE search_vector IS NULL) so finding pending rows stayscheap. Live log streaming is unaffected — it's push-based over PubSub and never
reads
search_vector; only full-text log search now lags ingestion slightly,typically under a minute.
Three migrations, ordered 1a→1b→1c: add
safe_to_tsvector, add the partial index(built per-partition
CONCURRENTLYthen attached to the parent), then drop thetrigger.
Closes #4425
Caveat
This is one thing about this approach worth mentioning:
In the situation of an idle instace, with a few logs being inserted.
Indexing is poll-driven by the 1-minute cron - a lone insert can't enqueue the worker itself.
So with ~zero load, a new row waits for the next cron tick: up to ~60s, ~30s on average.
This is a floor set by the cron cadence, not by load.
Validation steps
mix ecto.migrate). Confirm thelog_linesset_search_vectortrigger is gone,safe_to_tsvectorexists, and thepartial index is VALID across the parent + all 100 partitions. The
dataclipstrigger should be untouched.search_vectoris NULL and it isn't matched by logsearch yet.
Lightning.LogLines.SearchVectorWorker— the row'ssearch_vectorispopulated and now matches
to_tsquery('english_nostop', …).erroring, and doesn't get stuck retrying.
mix test test/lightning/log_lines/search_vector_worker_test.exs test/lightning/runs_test.exsAdditional notes for the reviewer
there's backlog, falling back to a 1-minute cron heartbeat. Concurrency is 1
on the dedicated queue (SKIP LOCKED makes bumping it safe later).
uniquestates arerestricted to
[:available, :scheduled]on purpose — the default includes:executing/:completed, which makes a running job dedup its own successorand kills the chain — and the index migration drops any INVALID leftover
before re-creating, so a failed
CONCURRENTLYbuild can be re-run.safe_to_tsvectoris also meant as the template for the dataclip trigger fix(Dataclip insert times out building the search vector, which loses the run #4800), which is separate.
AI Usage
Pre-submission checklist