Defer log_lines search_vector indexing off the insert path by stuartc · Pull Request #4818 · OpenFn/lightning

stuartc · 2026-05-30T18:02:06Z

Description

This changes how log_lines full-text search gets indexed — taking it off the
log-insert hot path to cut down the run:log channel timeouts we've been seeing
under heavy log volume.

The old AFTER INSERT trigger computed to_tsvector synchronously inside the
insert transaction (so it never actually deferred anything) and double-wrote
every row via a self-UPDATE. It also had no program_limit_exceeded guard, so
a single oversized message would abort the insert and roll back the whole batch
on the insert_all path.

Now inserts leave search_vector NULL and a background Oban worker
(Lightning.LogLines.SearchVectorWorker) backfills it out-of-band on a dedicated
search_indexing queue. There's a guarded safe_to_tsvector SQL function and a
partial index (WHERE search_vector IS NULL) so finding pending rows stays
cheap. Live log streaming is unaffected — it's push-based over PubSub and never
reads search_vector; only full-text log search now lags ingestion slightly,
typically under a minute.

Three migrations, ordered 1a→1b→1c: add safe_to_tsvector, add the partial index
(built per-partition CONCURRENTLY then attached to the parent), then drop the
trigger.

Closes #4425

Caveat

This is one thing about this approach worth mentioning:

In the situation of an idle instace, with a few logs being inserted.
Indexing is poll-driven by the 1-minute cron - a lone insert can't enqueue the worker itself.
So with ~zero load, a new row waits for the next cron tick: up to ~60s, ~30s on average.
This is a floor set by the cron cadence, not by load.

Validation steps

Run the migrations (mix ecto.migrate). Confirm the log_lines
set_search_vector trigger is gone, safe_to_tsvector exists, and the
partial index is VALID across the parent + all 100 partitions. The
dataclips trigger should be untouched.
Insert a log line — its search_vector is NULL and it isn't matched by log
search yet.
Run Lightning.LogLines.SearchVectorWorker — the row's search_vector is
populated and now matches to_tsquery('english_nostop', …).
Insert an oversized (>1MB) message — it drains to an empty vector rather than
erroring, and doesn't get stuck retrying.
mix test test/lightning/log_lines/search_vector_worker_test.exs test/lightning/runs_test.exs

Additional notes for the reviewer

The worker drains newest-first and "snowballs" an immediate follow-up while
there's backlog, falling back to a 1-minute cron heartbeat. Concurrency is 1
on the dedicated queue (SKIP LOCKED makes bumping it safe later).
Two subtle bits worth a close look: the snowball's Oban unique states are
restricted to [:available, :scheduled] on purpose — the default includes
:executing/:completed, which makes a running job dedup its own successor
and kills the chain — and the index migration drops any INVALID leftover
before re-creating, so a failed CONCURRENTLY build can be re-run.
safe_to_tsvector is also meant as the template for the dataclip trigger fix
(Dataclip insert times out building the search vector, which loses the run #4800), which is separate.

AI Usage

I have used Claude Code
I have used another model
I have not used AI

Pre-submission checklist

I have performed an AI review of my code
I have implemented and tested all related authorization policies.
I have updated the changelog.
I have ticked a box in "AI usage" in this PR

github-actions · 2026-05-30T18:34:15Z

Security Review ✅

S0 (project scoping): N/A — the new SearchVectorWorker is system-triggered (Oban cron) and rewrites log_lines.search_vector in place via raw SQL with no user input or output path, so there is no query or endpoint where project scoping could apply (lib/lightning/log_lines/search_vector_worker.ex:42-69).
S1 (authorization): N/A — no new controllers, LiveView events, or channel handlers; the only entrypoints are Oban.Worker.perform/1 and the cron schedule in lib/lightning/config/bootstrap.ex:273.
S2 (audit trail): N/A — the worker only backfills a derived search_vector column and the migrations adjust a DB function, an index, and a trigger; none of these are project/instance configuration resources covered by the existing audit modules.

taylordowns2000 · 2026-05-31T09:51:00Z

@stuartc , fantastic find here. I noticed that the description seemed to reference inserting multiple log lines at once. Wanted to double check: batch log lines insert is turned off on app.openfn.org, right?

I think we tried it a few months back and when the crashes started occurring we switched it off. I believe Joe has been tracking a bug related to log line ordering on the bulk log line feature too.

(Just double checking that we're all on the same page here.)

stuartc · 2026-06-01T11:33:37Z

@taylordowns2000 similar, but it's not related to batch log lines, with this we update the tsvector in batches (and asynchronously). should be a bit faster and way less stress on the index.

And no, you're correct; we don't use batch logging on production right now. I can't remember what the issue was, but worth following up on this - but yeah it's independent of this.

There's a complimentary PR #4821 over here for dataclips. Holding both of them as draft just for a sec, the current approach on this relies on a 'snowballing' oban job, which is nice because it disconnects the insertion of loglines and dataclips from indexing (nice). But there is a small downside in that there is a chance that in a quiet system that has something inserted at like 2s after the minute would have to wait for the next minute to roll over to be indexed/searchable. So it's really designed right now for busy systems. I want to evaluate what something a bit more eager would look like before calling this done done.

The AFTER INSERT trigger on log_lines computed to_tsvector synchronously inside the insert transaction (it never actually deferred anything) and double-wrote every row via a self-UPDATE. It also lacked the program_limit_exceeded guard dataclips got, so an oversized message aborted the insert (rolling back the whole batch for insert_all). Remove the trigger so inserts leave search_vector NULL, and backfill it out-of-band via a new Oban worker (Lightning.LogLines.SearchVectorWorker) on a dedicated search_indexing queue. Adds a guarded safe_to_tsvector SQL function and a partial index (WHERE search_vector IS NULL) so draining pending rows stays cheap. Covers both the single (run:log) and batch (run:batch_logs) insert paths. Live log streaming is unaffected (push-based via PubSub); only full-text log search lags slightly behind ingestion.

Follow-up hardening on the search_vector deferral, from a review pass: - Fix the worker's snowball chain. Oban's default unique states include :executing and :completed, so a running snowball job matched itself when enqueueing its successor and the insert was silently deduped, breaking the chain after one hop. Restrict uniqueness to [:available, :scheduled]. - Make the pending-search index migration re-runnable: a failed CREATE INDEX CONCURRENTLY leaves an INVALID index that IF NOT EXISTS would skip and ATTACH would leave the parent invalid forever. Drop any invalid leftover first. - Make safe_to_tsvector NULL-defensive and re-runnable (CREATE OR REPLACE, drop STRICT, COALESCE the doc) so it never returns NULL and leaves a row stuck in the pending set. Also the template for the dataclip fix. - Add snowball uniqueness regression tests.

Trim hindsight/diff-narrating comments down to what's non-obvious, and rewrite the SearchVectorWorker moduledoc to read as documentation of the mechanism rather than a justification of the change.

Deferring log_lines.search_vector indexing off the insert path means inserted log lines have a NULL search_vector until SearchVectorWorker drains them. Tests that insert log lines and then search them on the log field matched nothing, since the worker never runs on its own in the test environment. Add Lightning.TestUtils.flush_log_search_index/0, which runs the worker synchronously in-process via Oban.Testing.perform_job/3 so it indexes the uncommitted sandbox rows, and call it in the two invocation_test setups that insert log lines before searching. Also add a positive log-match assertion to the stemming test so a regression that re-NULLs the vector fails loudly rather than passing on an empty result.

codecov · 2026-06-02T15:04:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.3%. Comparing base (966454e) to head (a018174).

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #4818     +/-   ##
=======================================
- Coverage   90.3%   90.3%   -0.0%     
=======================================
  Files        442     443      +1     
  Lines      22547   22562     +15     
=======================================
+ Hits       20366   20375      +9     
- Misses      2181    2187      +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

midigofrank

Nicely done dude

Make the SearchVectorWorker batch_size/max_batches configurable through the Lightning.Config seam (defaults 2500/10 in config.exs, 2/2 in test.exs) so a test can drive the budget-exhaustion/snowball path with a handful of rows. Restructure drain/2 into drain/4 since the max_batches guard can no longer be a compile-time literal, and add a test that exercises the recursive drain, budget guard, and snowball enqueue.

github-project-automation Bot added this to Core May 30, 2026

github-project-automation Bot moved this to New Issues in Core May 30, 2026

stuartc force-pushed the defer-log-lines-search-vector branch from f3f45af to cae40bc Compare May 30, 2026 18:32

stuartc marked this pull request as ready for review May 30, 2026 18:32

stuartc mentioned this pull request Jun 1, 2026

Defer dataclip search_vector indexing off the insert path #4821

Merged

7 tasks

stuartc marked this pull request as draft June 1, 2026 11:24

stuartc mentioned this pull request Jun 2, 2026

Resolve [run:log] timeouts OpenFn/kit#1430

Closed

stuartc requested a review from midigofrank June 2, 2026 13:16

stuartc marked this pull request as ready for review June 2, 2026 13:16

stuartc added 4 commits June 2, 2026 16:55

Refine comments and moduledoc for deferred log_lines indexing

c98529a

Trim hindsight/diff-narrating comments down to what's non-obvious, and rewrite the SearchVectorWorker moduledoc to read as documentation of the mechanism rather than a justification of the change.

stuartc force-pushed the defer-log-lines-search-vector branch from cae40bc to 7b7f49a Compare June 2, 2026 14:55

midigofrank approved these changes Jun 3, 2026

View reviewed changes

stuartc merged commit 9b000a0 into main Jun 4, 2026
7 checks passed

stuartc deleted the defer-log-lines-search-vector branch June 4, 2026 05:32

github-project-automation Bot moved this from New Issues to Done in Core Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer log_lines search_vector indexing off the insert path#4818

Defer log_lines search_vector indexing off the insert path#4818
stuartc merged 5 commits into
mainfrom
defer-log-lines-search-vector

stuartc commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

taylordowns2000 commented May 31, 2026

Uh oh!

stuartc commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

midigofrank left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stuartc commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Caveat

Validation steps

Additional notes for the reviewer

AI Usage

Pre-submission checklist

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Review ✅

Uh oh!

taylordowns2000 commented May 31, 2026

Uh oh!

stuartc commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

midigofrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stuartc commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading