feat: Update embedding logic to bulk#1037
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.
Changes:
- Introduces a
MarkdownChunkmodel and updates chunking/output/tests to useHeading+ChunkText. - Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
- Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| EssentialCSharp.Chat/Program.cs | Updates chunk stats/output to use MarkdownChunk fields. |
| EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs | Updates assertions for the new chunk model (ChunkText). |
| EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs | Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances. |
| EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs | Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly. |
| EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs | Implements batch embedding + staging-then-swap upload strategy using Npgsql. |
| EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs | Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex. |
| EssentialCSharp.Chat.Shared/Services/AISearchService.cs | Adds heading-based deduplication of vector search results. |
| EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs | Adds ChunkIndex as stored metadata for chunks. |
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | ||
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | ||
| // The single-string overload used previously is a convenience extension method | ||
| // that wraps one item and calls this same method. | ||
| var chunkList = bookContents.ToList(); | ||
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | ||
|
|
||
| GeneratedEmbeddings<Embedding<float>> embeddings = | ||
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | ||
|
|
||
| if (embeddings.Count != chunkList.Count) | ||
| throw new InvalidOperationException( | ||
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | ||
|
|
||
| for (int i = 0; i < chunkList.Count; i++) | ||
| { | ||
| chunkList[i].TextEmbedding = embeddings[i].Vector; |
There was a problem hiding this comment.
The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | |
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | |
| // The single-string overload used previously is a convenience extension method | |
| // that wraps one item and calls this same method. | |
| var chunkList = bookContents.ToList(); | |
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | |
| GeneratedEmbeddings<Embedding<float>> embeddings = | |
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | |
| if (embeddings.Count != chunkList.Count) | |
| throw new InvalidOperationException( | |
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | |
| for (int i = 0; i < chunkList.Count; i++) | |
| { | |
| chunkList[i].TextEmbedding = embeddings[i].Vector; | |
| // ── Step 2: Batch-embed all chunks in provider-safe API calls ───────────────── | |
| // Azure OpenAI embeddings impose an input-count limit per request. | |
| // Process the texts in batches and merge the returned vectors back into the | |
| // original chunk list to preserve ordering. | |
| var chunkList = bookContents.ToList(); | |
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | |
| const int maxEmbeddingBatchSize = 2048; | |
| for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize) | |
| { | |
| int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart); | |
| List<string> batchTexts = texts.GetRange(batchStart, batchSize); | |
| GeneratedEmbeddings<Embedding<float>> embeddings = | |
| await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken); | |
| if (embeddings.Count != batchSize) | |
| throw new InvalidOperationException( | |
| $"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}."); | |
| for (int i = 0; i < batchSize; i++) | |
| { | |
| chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector; | |
| } |
| // Drop any leftover backup from a previous run | ||
| cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); | ||
|
|
||
| // Rename live → old. IF EXISTS is a no-op on first run when no live table exists. | ||
| // Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely. | ||
| cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); | ||
|
|
||
| // Rename staging → live | ||
| cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); |
There was a problem hiding this comment.
These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.
| /// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction. | ||
| /// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit | ||
| /// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state. | ||
| /// 5. Drop the old live backup table. |
There was a problem hiding this comment.
The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.
| /// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction. | |
| /// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit | |
| /// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state. | |
| /// 5. Drop the old live backup table. | |
| /// 4. Atomically swap tables in a single transaction using two SQL RENAME operations | |
| /// (live → old, staging → live). PostgreSQL ALTER TABLE acquires | |
| /// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The | |
| /// transaction ensures no reader sees an intermediate state. | |
| /// 5. Drop the old live backup table with DROP TABLE. |
| public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore( | ||
| IEnumerable<BookContentChunk> bookContents, | ||
| CancellationToken cancellationToken, | ||
| string? collectionName = null) | ||
| { | ||
| collectionName ??= CollectionName; | ||
| string stagingName = $"{collectionName}_staging"; | ||
| string oldName = $"{collectionName}_old"; | ||
|
|
||
| var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName); | ||
| await collection.EnsureCollectionDeletedAsync(cancellationToken); | ||
| await collection.EnsureCollectionExistsAsync(cancellationToken); | ||
| if (dataSource is null) | ||
| throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI."); | ||
|
|
||
| int uploadedCount = 0; | ||
| // ── Step 1: Prepare staging collection ──────────────────────────────────────── | ||
| var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName); | ||
| await staging.EnsureCollectionDeletedAsync(cancellationToken); | ||
| await staging.EnsureCollectionExistsAsync(cancellationToken); | ||
|
|
||
| foreach (var chunk in bookContents) | ||
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | ||
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | ||
| // The single-string overload used previously is a convenience extension method | ||
| // that wraps one item and calls this same method. | ||
| var chunkList = bookContents.ToList(); | ||
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | ||
|
|
||
| GeneratedEmbeddings<Embedding<float>> embeddings = | ||
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | ||
|
|
||
| if (embeddings.Count != chunkList.Count) | ||
| throw new InvalidOperationException( | ||
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | ||
|
|
||
| for (int i = 0; i < chunkList.Count; i++) | ||
| { | ||
| chunkList[i].TextEmbedding = embeddings[i].Vector; | ||
| } | ||
|
|
||
| // ── Step 3: Batch-upsert all chunks into staging ────────────────────────────── | ||
| await staging.UpsertAsync(chunkList, cancellationToken); | ||
| Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'."); |
There was a problem hiding this comment.
GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.
| var candidates_list = new List<VectorSearchResult<BookContentChunk>>(); | ||
| await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken)) | ||
| { | ||
| results.Add(result); | ||
| candidates_list.Add(result); | ||
| } |
There was a problem hiding this comment.
Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.
BenjaminMichaelis
left a comment
There was a problem hiding this comment.
Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:
Must fix:
candidates_list(AISearchService.cs) violates C# camelCase convention — should becandidatesList.NpgsqlDataSource?is nullable/optional in the constructor butGenerateBookContentEmbeddingsAndUploadToVectorStorethrows immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.- Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.
Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.
Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.
- Rename candidates_list → candidatesList (C# camelCase convention) - Make NpgsqlDataSource required in EmbeddingService constructor (always registered in DI; optional+throw was misleading anti-pattern) - Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call to respect Azure OpenAI input limit - Validate collectionName against safe identifier regex before SQL use - Add best-effort staging cleanup on UpsertAsync failure (nested try so cleanup exception cannot mask the original) - Document ChapterNumber nullability on BookContentChunk property and ToBookContentChunks public method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ment
- Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is
not OperationCanceledException)` + Console.Error.WriteLine so cleanup
failures are visible without masking the original exception
- Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live)
plus DROP TABLE statements, not "three SQL RENAMEs"
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
078bc93 to
d25e8b8
Compare
Description
Describe your changes here.
Fixes #Issue_Number (if available)
Ensure that your pull request has followed all the steps below: