Skip to content

feat: Update embedding logic to bulk#1037

Open
BenjaminMichaelis wants to merge 3 commits intomainfrom
bmichaelis/EmbeddingUpdates
Open

feat: Update embedding logic to bulk#1037
BenjaminMichaelis wants to merge 3 commits intomainfrom
bmichaelis/EmbeddingUpdates

Conversation

@BenjaminMichaelis
Copy link
Copy Markdown
Member

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

  • Code compilation
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

Copilot AI review requested due to automatic review settings April 26, 2026 15:21
Comment on lines +69 to +76
foreach (var raw in fileContent)
{
var line = raw.Trim();
var isBlank = string.IsNullOrWhiteSpace(line);
if (!isBlank || !lastWasBlank)
normalizedLines.Add(line);
lastWasBlank = isBlank;
}
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

  • Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
  • Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
  • Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
EssentialCSharp.Chat/Program.cs Updates chunk stats/output to use MarkdownChunk fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs Updates assertions for the new chunk model (ChunkText).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs Adds ChunkIndex as stored metadata for chunks.

Comment on lines +62 to +78
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).

Suggested change
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();
GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
// ── Step 2: Batch-embed all chunks in provider-safe API calls ─────────────────
// Azure OpenAI embeddings impose an input-count limit per request.
// Process the texts in batches and merge the returned vectors back into the
// original chunk list to preserve ordering.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();
const int maxEmbeddingBatchSize = 2048;
for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize)
{
int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart);
List<string> batchTexts = texts.GetRange(batchStart, batchSize);
GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken);
if (embeddings.Count != batchSize)
throw new InvalidOperationException(
$"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}.");
for (int i = 0; i < batchSize; i++)
{
chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector;
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +107
// Drop any leftover backup from a previous run
cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);

// Rename live → old. IF EXISTS is a no-op on first run when no live table exists.
// Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely.
cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);

// Rename staging → live
cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +40
/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.

Suggested change
/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table.
/// 4. Atomically swap tables in a single transaction using two SQL RENAME operations
/// (live → old, staging → live). PostgreSQL ALTER TABLE acquires
/// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The
/// transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table with DROP TABLE.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +83
public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore(
IEnumerable<BookContentChunk> bookContents,
CancellationToken cancellationToken,
string? collectionName = null)
{
collectionName ??= CollectionName;
string stagingName = $"{collectionName}_staging";
string oldName = $"{collectionName}_old";

var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName);
await collection.EnsureCollectionDeletedAsync(cancellationToken);
await collection.EnsureCollectionExistsAsync(cancellationToken);
if (dataSource is null)
throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI.");

int uploadedCount = 0;
// ── Step 1: Prepare staging collection ────────────────────────────────────────
var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName);
await staging.EnsureCollectionDeletedAsync(cancellationToken);
await staging.EnsureCollectionExistsAsync(cancellationToken);

foreach (var chunk in bookContents)
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
}

// ── Step 3: Batch-upsert all chunks into staging ──────────────────────────────
await staging.UpsertAsync(chunkList, cancellationToken);
Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'.");
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 40 to 44
var candidates_list = new List<VectorSearchResult<BookContentChunk>>();
await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken))
{
results.Add(result);
candidates_list.Add(result);
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

@BenjaminMichaelis BenjaminMichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:

  1. candidates_list (AISearchService.cs) violates C# camelCase convention — should be candidatesList.
  2. NpgsqlDataSource? is nullable/optional in the constructor but GenerateBookContentEmbeddingsAndUploadToVectorStore throws immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.
  3. Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.

Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.

Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed
BenjaminMichaelis and others added 3 commits May 6, 2026 20:52
- Rename candidates_list → candidatesList (C# camelCase convention)
- Make NpgsqlDataSource required in EmbeddingService constructor (always
  registered in DI; optional+throw was misleading anti-pattern)
- Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call
  to respect Azure OpenAI input limit
- Validate collectionName against safe identifier regex before SQL use
- Add best-effort staging cleanup on UpsertAsync failure (nested try so
  cleanup exception cannot mask the original)
- Document ChapterNumber nullability on BookContentChunk property and
  ToBookContentChunks public method

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ment

- Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is
  not OperationCanceledException)` + Console.Error.WriteLine so cleanup
  failures are visible without masking the original exception
- Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live)
  plus DROP TABLE statements, not "three SQL RENAMEs"

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BenjaminMichaelis BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 078bc93 to d25e8b8 Compare May 7, 2026 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants