feat: Update embedding logic to bulk by BenjaminMichaelis · Pull Request #1037 · IntelliTect/EssentialCSharp.Web

BenjaminMichaelis · 2026-04-26T15:21:37Z

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

Code compilation
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

+        foreach (var raw in fileContent)
+        {
+            var line = raw.Trim();
+            var isBlank = string.IsNullOrWhiteSpace(line);
+            if (!isBlank || !lastWasBlank)
+                normalizedLines.Add(line);
+            lastWasBlank = isBlank;
+        }


Copilot

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
EssentialCSharp.Chat/Program.cs	Updates chunk stats/output to use `MarkdownChunk` fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs	Updates assertions for the new chunk model (`ChunkText`).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs	Preserves blank lines for paragraph-aware splitting and emits `MarkdownChunk` instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs	Adds `MarkdownChunk` record and changes `FileChunkingResult.Chunks` type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs	Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs	Updates conversion to `BookContentChunk` using `MarkdownChunk` and adds deterministic IDs + `ChunkIndex`.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs	Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs	Adds `ChunkIndex` as stored metadata for chunks.

Copilot · 2026-04-26T15:25:42Z

+        // ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
+        // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
+        // The single-string overload used previously is a convenience extension method
+        // that wraps one item and calls this same method.
+        var chunkList = bookContents.ToList();
+        var texts = chunkList.Select(c => c.ChunkText).ToList();
+
+        GeneratedEmbeddings<Embedding<float>> embeddings =
+            await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
+
+        if (embeddings.Count != chunkList.Count)
+            throw new InvalidOperationException(
+                $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
+
+        for (int i = 0; i < chunkList.Count; i++)
+        {
+            chunkList[i].TextEmbedding = embeddings[i].Vector;


The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).

Suggested change

// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────

// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.

// The single-string overload used previously is a convenience extension method

// that wraps one item and calls this same method.

var chunkList = bookContents.ToList();

var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =

await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)

throw new InvalidOperationException(

$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)

{

chunkList[i].TextEmbedding = embeddings[i].Vector;

// ── Step 2: Batch-embed all chunks in provider-safe API calls ─────────────────

// Azure OpenAI embeddings impose an input-count limit per request.

// Process the texts in batches and merge the returned vectors back into the

// original chunk list to preserve ordering.

var chunkList = bookContents.ToList();

var texts = chunkList.Select(c => c.ChunkText).ToList();

const int maxEmbeddingBatchSize = 2048;

for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize)

{

int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart);

List<string> batchTexts = texts.GetRange(batchStart, batchSize);

GeneratedEmbeddings<Embedding<float>> embeddings =

await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken);

if (embeddings.Count != batchSize)

throw new InvalidOperationException(

$"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}.");

for (int i = 0; i < batchSize; i++)

{

chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector;

}

Copilot · 2026-04-26T15:25:42Z

+            // Drop any leftover backup from a previous run
+            cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);
+
+            // Rename live → old. IF EXISTS is a no-op on first run when no live table exists.
+            // Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely.
+            cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);
+
+            // Rename staging → live
+            cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);


These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.

Copilot · 2026-04-26T15:25:42Z

+    ///   4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
+    ///      PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
+    ///      LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
+    ///   5. Drop the old live backup table.


The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.

Suggested change

/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.

/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit

/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.

/// 5. Drop the old live backup table.

/// 4. Atomically swap tables in a single transaction using two SQL RENAME operations

/// (live → old, staging → live). PostgreSQL ALTER TABLE acquires

/// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The

/// transaction ensures no reader sees an intermediate state.

/// 5. Drop the old live backup table with DROP TABLE.

Copilot · 2026-04-26T15:25:43Z

+    public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore(
+        IEnumerable<BookContentChunk> bookContents,
+        CancellationToken cancellationToken,
+        string? collectionName = null)
    {
        collectionName ??= CollectionName;
+        string stagingName = $"{collectionName}_staging";
+        string oldName = $"{collectionName}_old";

-        var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName);
-        await collection.EnsureCollectionDeletedAsync(cancellationToken);
-        await collection.EnsureCollectionExistsAsync(cancellationToken);
+        if (dataSource is null)
+            throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI.");

-        int uploadedCount = 0;
+        // ── Step 1: Prepare staging collection ────────────────────────────────────────
+        var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName);
+        await staging.EnsureCollectionDeletedAsync(cancellationToken);
+        await staging.EnsureCollectionExistsAsync(cancellationToken);

-        foreach (var chunk in bookContents)
+        // ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
+        // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
+        // The single-string overload used previously is a convenience extension method
+        // that wraps one item and calls this same method.
+        var chunkList = bookContents.ToList();
+        var texts = chunkList.Select(c => c.ChunkText).ToList();
+
+        GeneratedEmbeddings<Embedding<float>> embeddings =
+            await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
+
+        if (embeddings.Count != chunkList.Count)
+            throw new InvalidOperationException(
+                $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
+
+        for (int i = 0; i < chunkList.Count; i++)
+        {
+            chunkList[i].TextEmbedding = embeddings[i].Vector;
+        }
+
+        // ── Step 3: Batch-upsert all chunks into staging ──────────────────────────────
+        await staging.UpsertAsync(chunkList, cancellationToken);
+        Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'.");


GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.

Copilot · 2026-04-26T15:25:43Z

+                var candidates_list = new List<VectorSearchResult<BookContentChunk>>();
+                await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken))
                {
-                    results.Add(result);
+                    candidates_list.Add(result);
                }


Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.

BenjaminMichaelis

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:

candidates_list (AISearchService.cs) violates C# camelCase convention — should be candidatesList.
NpgsqlDataSource? is nullable/optional in the constructor but GenerateBookContentEmbeddingsAndUploadToVectorStore throws immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.
Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.

Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.

Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.

- Rename candidates_list → candidatesList (C# camelCase convention) - Make NpgsqlDataSource required in EmbeddingService constructor (always registered in DI; optional+throw was misleading anti-pattern) - Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call to respect Azure OpenAI input limit - Validate collectionName against safe identifier regex before SQL use - Add best-effort staging cleanup on UpsertAsync failure (nested try so cleanup exception cannot mask the original) - Document ChapterNumber nullability on BookContentChunk property and ToBookContentChunks public method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ment - Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is not OperationCanceledException)` + Console.Error.WriteLine so cleanup failures are visible without masking the original exception - Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live) plus DROP TABLE statements, not "three SQL RENAMEs" Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 26, 2026 15:21

Copilot started reviewing on behalf of BenjaminMichaelis April 26, 2026 15:21 View session

github-code-quality Bot found potential problems Apr 26, 2026

View reviewed changes

Copilot AI reviewed Apr 26, 2026

View reviewed changes

BenjaminMichaelis commented May 6, 2026

View reviewed changes

github-code-quality Bot found potential problems May 7, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed

BenjaminMichaelis and others added 3 commits May 6, 2026 20:52

Update embedding logic

76d6d17

BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 078bc93 to d25e8b8 Compare May 7, 2026 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Update embedding logic to bulk#1037

feat: Update embedding logic to bulk#1037
BenjaminMichaelis wants to merge 3 commits intomainfrom
bmichaelis/EmbeddingUpdates

BenjaminMichaelis commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

BenjaminMichaelis left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    ///   4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
-    ///      PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
-    ///      LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
-    ///   5. Drop the old live backup table.
+    ///   4. Atomically swap tables in a single transaction using two SQL RENAME operations
+    ///      (live → old, staging → live). PostgreSQL ALTER TABLE acquires
+    ///      AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The
+    ///      transaction ensures no reader sees an intermediate state.
+    ///   5. Drop the old live backup table with DROP TABLE.

Conversation

BenjaminMichaelis commented Apr 26, 2026

Description

Ensure that your pull request has followed all the steps below:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminMichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants